home *** CD-ROM | disk | FTP | other *** search
Wrap
Text File | 1992-01-23 | 542.4 KB | 17,398 lines
Log-Number: 30568 From: mendel (Mendel Rosenblum) Subject: Re: SIGHUP problems Date: Wed, 02 Jan 91 11:18:54 PST > Return-Path: tve > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA210742; Fri, 28 Dec 90 18:07:17 PST > Date: Fri, 28 Dec 90 18:07:17 PST > From: tve (Thorsten von Eicken) > Message-Id: <9012290207.AA210742@sprite.Berkeley.EDU> > To: bugs > Subject: SIGHUP problems > > It seems that it is either impossible to send or to receive a SIGHUP. > Here's an example program which prints "interrupt" if it receives a SIGINT > and "hangup" if it receives a SIGHUP. Following that a little log of my > attempts to get it to print "hangup", but it always prints "interrupt". > I tried the exact same thing on sunOS and it behaves as expected. > The problem here is caused by the mapping of unix signals into sprite signals. The unix compatibility library maps SIGHUP into the sprite signal SIG_INTERRUPT. SIG_INTERRUPT is mapped back to SIGINT. This problem will be fixed when we convert the kernel to use unix signals numbers. In the mean time, the current mapping of signals can be found in the file /sprite/src/lib/c/unixSyscall/compatSig.h Mendel Log-Number: 30569 Subject: MIP C compiler info Date: Wed, 02 Jan 91 15:18:39 PST From: Mike Kupfer <kupfer> [This is to get it into the bug log for discussion at a Monday meeting. -mdk] ------- Forwarded Message Date: Mon, 31 Dec 90 12:58:01 PST >From: sethg (Seth Copen Goldstein) To: root@sprite.Berkeley.EDU Subject: how do I get the man page for mips c compiler? ------- End of Forwarded Message Log-Number: 30571 Subject: Re: 'mig rup' doesn't work Date: Wed, 02 Jan 91 20:51:38 PST From: Mike Kupfer <kupfer> > Date: Wed, 2 Jan 91 19:24:10 PST > From: dlong@dogwood.ucsc.edu (Dean Long) > To: sprite@sprite.Berkeley.EDU > Subject: 'mig rup' doesn't work > > Keywords: mig rup shell script > > 'mig rup' doesn't work, because mig uses the Bourne shell to run > rup since it doesn't have an a.out header. Maybe rup should > have '#! /sprite/cmds/csh' as the first line. Or maybe mig should > try feeding the shell script to the user's default shell. > > dl Thanks for the bug report. FYI, please send bug reports to "bugs@sprite.berkeley.edu", not "sprite@sprite.berkeley.edu". The bug report script (the one that generates the report that we review every week) only looks at messages that are sent to "bugs". thanks, Mike Kupfer Log-Number: 30577 Date: Thu, 3 Jan 91 20:48:33 PST From: dlong@dogwood.ucsc.edu (Dean Long) Subject: dd -bs doesn't work When trying to copy one disk to another, the following gave me errors: rsh mach dd if=disk0 bs=bsize | dd of=disk1 bs=bsize while the following worked fine: rsh mach dd if=disk0 ibs=bsize obs=bsize | dd of=disk1 ibs=bsize obs=bsize I thought bs should set ibs and obs, i.e. the two command above should be equivalent. dl Log-Number: 30579 From: mendel (Mendel Rosenblum) Subject: Bug in Bit_ library routines Date: Fri, 04 Jan 91 10:48:23 PST The library routine Bit_FindFirstClear() doesn't work correctly if the number of bits in not a multiple of 32. Rather than return (-1) for "no bits found cleared" it returns the first cleared bit in the leftover bits in the last word. For example: int *bitMapArray; Bit_Alloc(20, bitMapArray); for (i = 0; i < 20; i++) { Bit_Set(i, bitMapArray); } Bit_FindFirstClear(20, bitMapArray) most likely returns 20 and not -1. Since Bit_Alloc() uses malloc() which doesn't zero the memory it returns, this same bug will sometimes hit Bit_FindFirstSet(). Do we use the Bit routines for anything important? Mendel Log-Number: 30583 Date: Sun, 6 Jan 91 13:09:16 PST From: bsw!adam@uunet.UU.NET (Adam de Boor) Subject: "Time" typedef conflict between Sprite and X Actually, the conflict with "Time" has been there since I first ported X11 to Sprite. John O's position then was there would always be conflicts like that and we couldn't go changing Sprite every time one came up. The solution he suggested (which I used) was to #define Time SpriteTime #include <file-that-defines-sprite's-Time> #undef Time then use "SpriteTime" where sprite's version of Time was required. One can also do it the other way (rename X's Time), of course. Just a bit of history from an historical person... a Log-Number: 30584 Date: Mon, 7 Jan 91 09:06:43 PST From: ouster (John Ousterhout) Subject: Allspice crash Allspice was dead when I came in this morning. The message on the console was something like "MachHandleTrap: the error was in a kernel process...." or something like that. I used kgcore to make a core dump, which I left on Ginger in /home/ginger/raid/cores/allspice.crash.1-7 -John- Log-Number: 30585 Date: Mon, 7 Jan 91 10:05:37 PST From: ouster (John Ousterhout) Subject: Another Allspice crash Allspice crashed again about 45 minutes after the first reboot. The message was "MachPageFault: page fault in kernel process: pc = 0x8". I made another core dump, in /home/ginger/raid/cores/allspice.crash.1-7b. Two crashes in a row gives me a bad feeling (why not 3 or 4?). Can someone take a look at these core dumps ASAP to make sure that there isn't a persistent problem that's going to cause continuous crashes every 45 minutes? -John- Log-Number: 30586 Date: Mon, 7 Jan 91 10:11:50 PST From: ouster (John Ousterhout) Subject: Bad magic number in core files? I tried to run Kgdb on the 1.-7b core file generated today, but I got the following message: "/home/ginger/raid/cores/allspice.crash.1-7b" does not appear to be a core dump file (magic 0xf6006020, expected 0x80456) The exact sequence of commands I used was: cd /home/ginger/sprite/kernels Gdb sun4.1.079 core /home/ginger/raid/cores/allspice.crash.1-7b Log-Number: 30587 From: mendel (Mendel Rosenblum) Subject: Re: Bad magic number in core files? Date: Mon, 07 Jan 91 12:06:47 PST > I tried to run Kgdb on the 1.-7b core file generated today, but I > got the following message: > > "/home/ginger/raid/cores/allspice.crash.1-7b" does not appear to be a core dump file (magic 0xf6006020, expected 0x80456) > > The exact sequence of commands I used was: > > cd /home/ginger/sprite/kernels > Gdb sun4.1.079 > core /home/ginger/raid/cores/allspice.crash.1-7b The problem here is that I haven't gotten kgdb.sun4.new to compile under Unix yet. I was able to use these core files by using kgdb.sun4.new on Sprite. Mendel Log-Number: 30588 From: mendel (Mendel Rosenblum) Subject: Re: Another Allspice crash Date: Mon, 07 Jan 91 12:52:19 PST > Allspice crashed again about 45 minutes after the first reboot. > The message was "MachPageFault: page fault in kernel process: pc = 0x8". > I made another core dump, in /home/ginger/raid/cores/allspice.crash.1-7b. > > Two crashes in a row gives me a bad feeling (why not 3 or 4?). Can > someone take a look at these core dumps ASAP to make sure that there isn't > a persistent problem that's going to cause continuous crashes every > 45 minutes? > -John- The crash was caused by a poison packet from a ds5000 (loiter). The mousetrap that was added to catch the problem only checked for a bogus value less than zero. The bogus value this time was 85, a value much greater than the number of elements in the array being indexed. Anyway, these mousetraps won't catch the problem because it appears to be in the rpc or net modules. The problem is that the RPC doesn't contain any parameter data (rpcHdr->paramSize == 0). Since the Reopen RPC stub doesn't check this it ends up using garbage from previous RPC as the arguments to the reopen procedures. The header from the RPC looks like: $16 = { version = 252575747, flags = 532, clientID = 83, serverID = 14, channel = 3, serverHint = 6, bootID = 663232769, ID = 42063, delay = 500, numFrags = 0, fragMask = 0, command = 32, paramSize = 0, dataSize = 0, paramOffset = 0, dataOffset = 0 } Either the paramSize or the command is incorrect. A better mousetrap for allspice should be to check the paramSize in the RPC stubs for Fsio_Reopen. Mendel Log-Number: 30589 Date: Mon, 7 Jan 91 18:02:32 PST From: shirriff (Ken Shirriff) Subject: CC man page Where is the cc man page defined? I can't find cc.man anywhere in the source tree, but only in /sprite/man/cmds/cc.man . Shouldn't this be in /sprite/src/cmds/cc? Ken Log-Number: 30590 Subject: allspice crash, busy block problem Date: Tue, 08 Jan 91 12:59:04 PST From: Mike Kupfer <kupfer> Allspice crashed with a level 15 interrupt shortly before noon. It died before I could reboot it to fix a problem that had wedged Emacs. This other problem was that a buffer had been marked as "busy waiting for I/O to complete" while there was apparently nobody actually doing I/O on that buffer. mike Log-Number: 30591 Date: Tue, 8 Jan 91 13:16:05 PST From: Darrell Long <darrell@sequoia.ucsc.edu> Subject: "more" and NFS I'll send more details as they become apparent, but using 1.079 on a Sun 4c, "more /nfs/file" loses the first few (about 80) characters when run on a client, but works on the server. Non-NFS files are OK. Other programs such as "vi" and "head" are also OK. "mig more /nfs/file" is also OK. DL Log-Number: 30592 Subject: more failed compilations Date: Tue, 08 Jan 91 13:27:55 PST From: Mike Kupfer <kupfer> I can't compile netroute or rpchist. Anyone know what happened? mike -- (netroute) cc -g -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -c netroute.c -o sun4.md/netroute.o netroute.c: In function main: netroute.c:150: `NET_ROUTE_ETHER' undeclared (first use this function) netroute.c:150: (Each undeclared identifier is reported only once netroute.c:150: for each function it appears in.) netroute.c:171: `NET_ROUTE_INET' undeclared (first use this function) netroute.c:210: `NetInetRoute' undeclared (first use this function) netroute.c:210: parse error before `inetRoute' netroute.c:226: `inetRoute' undeclared (first use this function) netroute.c:256: parse error before `inetRoute' netroute.c:347: parse error before `inetRoute' (rpchist) cc -g -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -c rpchist.c -o sun4.md/rpchist.o rpchist.c: In function PrintCommand: rpchist.c:393: `RPC_PROC_MIG_INIT' undeclared (first use this function) rpchist.c:393: (Each undeclared identifier is reported only once rpchist.c:393: for each function it appears in.) rpchist.c:396: `RPC_PROC_MIG_INFO' undeclared (first use this function) rpchist.c:441: `RPC_FS_DEV_REOPEN' undeclared (first use this function) Log-Number: 30593 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 8 Jan 1991 13:36:10 PST Subject: Re: more failed compilations Netroute is an obsolete program that has been replaced by netroute.new. Netroute can go away (and netroute.new renamed) as soon as any kernels older than 1.078 are gone. Is there any pressing need for it to be recompiled? John Log-Number: 30595 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 8 Jan 1991 16:39:14 PST Subject: lseek fixed I fixed a bug that caused lseek to not work for pseudo-filesystems. This caused Bob problems with msgs, and Darrell problems with more. The problem should go away in the next kernel. John Log-Number: 30599 Subject: cpp appends space in .Xdefaults? Date: Wed, 09 Jan 91 11:42:55 PST From: Mike Kupfer <kupfer> When I was at Olivetti I used cpp to define my default font and window border width (because I had two different displays, one at 75 dpi and one at 102 dpi). When I got to Sprite, I had to hand-copy all the font definitions because somebody (cpp?) was appending a blank to the font specifications, which was confusing Emacs. It turns out that the problem is not limited to strings--I had problems running JO's Tk demo because the BorderWidth was registered as "2 " instead of 2. mike -- (begin test case) #define STD_FONT *courier-medium-r-normal--*-120* #ifdef USE_MACRO emacs*font: STD_FONT #else emacs*font: *courier-medium-r-normal--*-120* #endif (end test case) Here's a script that demonstrates the problem. (begin script) sage% xrdb -query | grep emacs sage% xrdb -load Xfoo sage% xrdb -query | grep emacs emacs*font: *courier-medium-r-normal--*-120* sage% emacs sage% xrdb -load Xfoo -DUSE_MACRO sage% xrdb -query | grep emacs emacs*font: *courier-medium-r-normal--*-120* sage% emacs emacs: X server unable to find requested font `*courier-medium-r-normal--*-120* '. sage% (end script) Log-Number: 30600 Date: Wed, 9 Jan 91 11:55:35 PST From: shirriff (Ken Shirriff) Subject: Re: cpp appends space in .Xdefaults? On the decstation, cc uses /usr/lib/cpp1.31 (Ultrix), which works correctly. (works correctly == no extra space) It's only the GCC cpp which doesn't work correctly. If you give GCC cpp the -traditional flag, it works correctly. So if we use cpp -traditional for Xdefaults, etc. then everything should be ok. Ken Log-Number: 30601 Date: Thu, 10 Jan 91 10:05:22 PST From: tve (Thorsten von Eicken) Subject: processes in th debugger on assault PID STATE TIME COMMAND 3190c DEBUG 1:54 nfsmount ginger:/var/spool/msgs /sprite/spool/msgs 21924 DEBUG 0:02 /sprite/daemons/portmap 11956 DEBUG 0:00 send-mail -i -m rothman@chowder Log-Number: 30602 From: mendel (Mendel Rosenblum) Subject: Gcc1.37.1 missing warning message Date: Thu, 10 Jan 91 11:30:36 PST Program: Gcc Version: 1.37.1 Machine type: Sparc (sun4) Options: -Wall -O Priority: Low The man page for gcc says that the "-W" option combined with the "-O" options should print an extra warning message if an automatic variable is used without being initialized. This doesn't work for the following program: void foobar() { extern void foo2 (int **); extern int svar; int *uninitPtr; svar = *uninitPtr; /* Dereference an uninitialized auto variable. */ foo2(&uninitPtr); /* If this line is present then gcc 1.37.1 with * the -Wall -O options doesn't generated an * warning message for the above line. */ } The problem appears to be related to the passing by reference of uninitPtr after the access in error. If the call to foo2() is removed or replaced with some other initialization such as "initPtr = (int *)0", the correct warning message is generated. Mendel Log-Number: 30603 Date: Thu, 10 Jan 91 13:03:07 PST From: shirriff (Ken Shirriff) Subject: Xcfbpmax broken /X11/R4/cmds/Xcfbpmax was recompiled yesterday and now it no longer works correctly. I can't access the syslog inside X because I get /dev/syslog: Invalid argument. This happens on old and new ds3100 kernels. Log-Number: 30607 Date: Thu, 10 Jan 91 20:34:29 PST From: shirriff (Ken Shirriff) Subject: Re: Xcfbpmax broken I fixed Xmfbpmax so the syslog works properly. The problem was it does this: if (!fopen(logfile,"a+")) then fopen("/dev/null") as log file. The logfile was recently changed from /usr/adm/Xmsgs to /dev/syslog. Note that "a+" opens for reading and writing. Thus, if you are already reading the syslog when this executes, it logs to /dev/null. However, if you aren't reading the syslog, Xmfbpmax latches onto it and then you can't read it. I changed Xmfbpmax to open with "a" and now it seems to work fine. Ken Log-Number: 30605 Date: Thu, 10 Jan 91 16:00:07 PST From: elm@king.Berkeley.EDU (ethan miller) Subject: problems w/Cory sprite All of a sudden (as of about 2 minutes before this e-mail was sent, at about 4:00 on 1/10/91), chisum started giving out lots of LE ethernet: Missed a packet messages. They were coming one per line (ie, I'd type a line, and I'd get a message after I hit return). We're running 1.075 over here. Could that have something to do with it? I know it's not a load problem; currently, the machine isn't running X (only the Mail program) and it's still happening. thanks ethan Log-Number: 30608 Date: Fri, 11 Jan 91 09:20:18 PST From: tve (Thorsten von Eicken) Subject: assault ipserver dead? can't login and some (all?) nfs filesystems give "I/O error" Log-Number: 30609 From: mendel (Mendel Rosenblum) Subject: Re: assault ipserver dead? Date: Fri, 11 Jan 91 10:22:47 PST The IpServer along with the inetd, sendmail, and were missing. I killed off the nfsmounts and executed restartServers. Mendel Log-Number: 30611 Date: Fri, 11 Jan 91 18:45:18 PST From: dlong@dogwood.ucsc.edu (Dean Long) Subject: bug in fsattach There is a bug in fsattach. It only produces the /hosts/$HOST/rsdXXX.fsc file for the first partition in the mount file. The reason is in function MoveOutput in /sprite/src/admin/fsattach/misc.c. The variable i in the outside loop is being reused for an inner loop. The following is a context diff to fix it (for misc.c 1.9). dl -----------------------chop chop here--------------------- *** misc.c.old Mon Oct 22 09:29:03 1990 --- misc.c Fri Jan 11 18:35:53 1991 *************** *** 419,425 **** int bytesWritten; int bytesToWrite; char buffer[1024]; ! int i; Boolean done; char *hostName; --- 419,425 ---- int bytesWritten; int bytesToWrite; char buffer[1024]; ! int i, mountIndex; Boolean done; char *hostName; *************** *** 427,449 **** printf("Moving output from fscheck.\n"); } hostName = getenv("HOST"); ! for(i = 0; i < mountCount; i++) { if (debug) { ! printf("%d (%s): device = %s, status = %s\n", i, ! mountTable[i].source, ! (mountTable[i].device == TRUE ? "true" : "false"), ! (mountTable[i].status == CHILD_OK) ? "ok" : "not ok"); } ! if (mountTable[i].checked == FALSE || ! mountTable[i].status != CHILD_OK || ! mountTable[i].device == FALSE) { continue; } (void) sprintf(outputFile, "/hosts/%s/%s.fsc", hostName, ! mountTable[i].source); if (verbose) { printf("Copying output from checking %s to %s.\n", ! mountTable[i].source, outputFile); } outputStream = fopen(outputFile, "a+"); if (outputStream == (FILE *)NULL) { --- 427,449 ---- printf("Moving output from fscheck.\n"); } hostName = getenv("HOST"); ! for(mountIndex = 0; mountIndex < mountCount; mountIndex++) { if (debug) { ! printf("%d (%s): device = %s, status = %s\n", mountIndex, ! mountTable[mountIndex].source, ! (mountTable[mountIndex].device == TRUE ? "true" : "false"), ! (mountTable[mountIndex].status == CHILD_OK) ? "ok" : "not ok"); } ! if (mountTable[mountIndex].checked == FALSE || ! mountTable[mountIndex].status != CHILD_OK || ! mountTable[mountIndex].device == FALSE) { continue; } (void) sprintf(outputFile, "/hosts/%s/%s.fsc", hostName, ! mountTable[mountIndex].source); if (verbose) { printf("Copying output from checking %s to %s.\n", ! mountTable[mountIndex].source, outputFile); } outputStream = fopen(outputFile, "a+"); if (outputStream == (FILE *)NULL) { *************** *** 452,458 **** perror(""); return; } ! (void) sprintf(inputFile, "%s/%s", mountTable[i].dest, tempOutputFile); tempStream = fopen(inputFile,"r+"); if (tempStream == (FILE *)NULL) { (void) fprintf(stderr, "%s: can't open \"%s\", ", progName, --- 452,459 ---- perror(""); return; } ! (void) sprintf(inputFile, "%s/%s", mountTable[mountIndex].dest, ! tempOutputFile); tempStream = fopen(inputFile,"r+"); if (tempStream == (FILE *)NULL) { (void) fprintf(stderr, "%s: can't open \"%s\", ", progName, Log-Number: 30613 Date: Sun, 13 Jan 91 01:51:21 PST From: dlong@dogwood.ucsc.edu (Dean Long) Subject: fsmake -dir doesn't seem to work If I use fsmake to make a filesystem, and give the -dir option for a directory to copy, it prints out all the files it supposedly copied, but when I mount the filesystem, the files aren't there, and fscheck gives errors on the file system. dl Log-Number: 30614 Date: Sun, 13 Jan 91 16:40:49 PST From: tve (Thorsten von Eicken) Subject: IPserver on assault is dead again! Log-Number: 30615 Subject: nfsmount botches store to full partition Date: Sun, 13 Jan 91 23:54:44 PST From: Mike Kupfer <kupfer> Suppose I'm editing a file with Emacs and there's not enough room to save the revised version. If the file is on a Sprite filesystem, when I try to save the file, Emacs sits there until I get bored and hit ^G, at which time it says there was an I/O error (presumably because the fsync() failed). If the file is on an NFS partition, Emacs eventually comes back and says that the file has been saved. This is a lie: only as many bytes as would fit have actually been saved. One can provoke similar behavior using cp or cat. If you overflow a Sprite partition, you won't get a complaint, and "ls -l" will show the sizes as though the copy had succeeded. If you overflow an NFS partition, you still won't get a complaint, but "ls -l" will show the copy as truncated down to whatever size would fit. The remaining bits are apparently just dropped on the floor. mike Log-Number: 30617 From: mendel (Mendel Rosenblum) Subject: Re: nfsmount botches store to full partition Date: Mon, 14 Jan 91 09:32:40 PST > Return-Path: kupfer > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA467295; Sun, 13 Jan 91 23:54:45 PST > Message-Id: <9101140754.AA467295@sprite.Berkeley.EDU> > To: bugs > Subject: nfsmount botches store to full partition > Date: Sun, 13 Jan 91 23:54:44 PST > From: Mike Kupfer <kupfer> > > Suppose I'm editing a file with Emacs and there's not enough room to > save the revised version. If the file is on a Sprite filesystem, when > I try to save the file, Emacs sits there until I get bored and hit ^G, > at which time it says there was an I/O error (presumably because the > fsync() failed). If the file is on an NFS partition, Emacs eventually > comes back and says that the file has been saved. This is a lie: only > as many bytes as would fit have actually been saved. > > One can provoke similar behavior using cp or cat. If you overflow a > Sprite partition, you won't get a complaint, and "ls -l" will show the > sizes as though the copy had succeeded. If you overflow an NFS > partition, you still won't get a complaint, but "ls -l" will show the > copy as truncated down to whatever size would fit. The remaining bits > are apparently just dropped on the floor. > > mike I think the problem is that in order to get acceptable write performance, Brent enabled write-behide on NFS pseudo-file systems files. This means that the nfsmount deamon acks the write before it knows if there is room of the file on the remote file system. If it can't write the blocks to disk it prints a message and tosses it. Mendel ps. Sorry about the blank message before this one. I pushed the "Send" button instead of the "Insert" button in xmh. Log-Number: 30618 Subject: xmh in infinite loop Date: Mon, 14 Jan 91 11:37:36 PST From: Mike Kupfer <kupfer> I came back from making tea, only to find xmh in an infinite loop. This is the stack backtrace from gdb. I'm not sure I entirely trust it, since the xmh I was running didn't have symbols, so I used the xmh in /X11/R4/src. mike -- #0 0x13084 in XawAsciiSourceChanged () #1 0x11d50 in XmhPrintView () #2 0x25204 in XawTextSourceRead () #3 0x1cc38 in _XawTextGetText () #4 0x1cc84 in _XawTextGetSTRING () #5 0x1eb58 in _SetSelection () #6 0x1fb58 in _XawTextSetSelection () #7 0x1fc58 in _XawTextAlterSelection () #8 0x21d20 in _XawTextZapSelection () #9 0x21e48 in _XawTextZapSelection () #10 0x4ddc0 in _XtMatchUsingDontCareMods () #11 0x36364 in XtWindowToWidget () #12 0x36a08 in _XtOnGrabList () #13 0x36aa8 in _XtOnGrabList () #14 0x5774 in main (...) (...) Log-Number: 30619 Subject: FS bottleneck if server is down? Date: Mon, 14 Jan 91 12:04:47 PST From: Mike Kupfer <kupfer> I'm editing a file on /scratch1, which is served by allspice. I save it (from Emacs), and it sits there for some seconds. While I'm waiting, I type "df /scratch1" to a shell. That hangs, too. Eventually I see <domain info> 1/14/91 11:58:12 raid1 (77) RPC timed-out in my syslog, after which the save and the df immediately complete. Can someone tell me what's going on here and whether this is avoidable? thanks, mike Log-Number: 30620 Date: Mon, 14 Jan 91 14:35:45 PST From: bmiller (Bob Miller) Subject: 'adduser' problem Any idea why adduser is not working for me??? Here's what I get... subversion:/user1/bmiller> adduser Enter 1 if you already have a /etc/passwd file from another machine. Enter 2 if you want to fetch an entry from the ucb data base. Enter 3 if you want to enter the information interactively. Enter q to quit Please choose 1, 2 or 3: 2 Enter user's last name: Slater Enter user's group: guest lastname is Slater, group is guest Is this correct? (y or n) y Fetching passwd entry from database on thalm. This will take a minute or two. Please be patient.... Permission denied. Could't fetch entry from thalm Make sure your machine is listed in /.rhosts Cleaning up ... Log-Number: 30621 Subject: Re: 'adduser' problem (/.rhosts on allspice) Date: Mon, 14 Jan 91 14:48:15 PST From: Mike Kupfer <kupfer> "adduser" didn't work because we had removed /.rhosts, because we couldn't remember why it was there. Oops. Well, now we know, and now it should be in the Sprite log for future reference. So, try again and let me know if it still doesn't work. mike Log-Number: 30622 Subject: serious printer lossage (whining) Date: Mon, 14 Jan 91 21:53:17 PST From: Mike Kupfer <kupfer> You would think I could print a crummy 4-page, 76KB postscript file (slides for a talk) on the Laserwriter in 608-2. And sometimes I can. Often, though, the job gets dropped on the floor, even when I'm out of the office, not doing anything on Sage. This is getting to be a real pain. What ever happened to the work on making sure we don't run with interrupts disabled for too long? mike Log-Number: 30623 Date: Mon, 14 Jan 91 23:55:58 PST From: elm (ethan miller) Subject: problem with most recent kernel The newest kernel (1.081) does not work correctly on raid2 in Cory. Specifically, it fails to mount any directories in Evans. I noticed this because my development kernel uses 1.081 .o modules, and I had problems. When I booted raid2 using the released 1.081, the problems accessing /sprite/src and /scratch1 were still there. The message was "Contacting server 14 for "/sprite/src" prefix" followed by "/sprite/src: no such file or directory". The problem doesn't happen when I boot using the most recent version "officially" sent to Cory (1.075?). ethan Log-Number: 30624 Date: Tue, 15 Jan 91 02:11:37 PST From: elm (ethan miller) Subject: bug in ls -l When I do an ls -l on a device with a large unit number (such as the ultranet device in /dev/ultra0), the unit number is listed as 0 even though it is actually 0x5000. This caused me lots of grief, and it seems like it should be easy to fix. This happens on a sun4.and a ds3100. ethan Log-Number: 30625 From: mendel (Mendel Rosenblum) Subject: Re: bug in ls -l Date: Tue, 15 Jan 91 09:30:47 PST > Return-Path: elm > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA332864; Tue, 15 Jan 91 02:11:37 PST > Date: Tue, 15 Jan 91 02:11:37 PST > From: elm (ethan miller) > Message-Id: <9101151011.AA332864@sprite.Berkeley.EDU> > To: bugs@sprite.Berkeley.EDU > Subject: bug in ls -l > > When I do an ls -l on a device with a large unit number (such as the > ultranet device in /dev/ultra0), the unit number is listed as 0 even > though it is actually 0x5000. This caused me lots of grief, and it > seems like it should be easy to fix. This happens on a sun4.and a > ds3100. > > ethan The problem is comes in the mapping of Sprite attributes to Unix attributes. In Sprite, device unit numbers are 16bits while in Unix they are only 8bits. So (unsigned char) 0x5000 == 0x0. It would be trival to fix this but it would require recompling the world (kernel and all user programs). All existing binaries would break. You don't know grief until you've try to change the structure returned by stat. Mendel ps. You can use the "stat" command to print out the full 16bits of the device and unit numbers. Log-Number: 30626 From: tve (Thorsten von Eicken) Subject: Re: new dvips? Date: Tue, 15 Jan 91 18:18:55 PST I backed out the new version of dvips. It works fine on our laserwriter II in 444 but doesn't on the old lw. I have no postscript knowledge and can't fix it. I will keep installing new tex software in a private area. TvE ------ Forwarded message ----- Return-Path: msilva Received: by sprite.Berkeley.EDU (5.59/1.29) id AA17479; Tue, 15 Jan 91 13:01:35 PST Date: Tue, 15 Jan 91 13:01:35 PST >From: msilva (Mario J. Silva) Message-Id: <9101152101.AA17479@sprite.Berkeley.EDU> To: tve Subject: dvips? There is something wrong with our printer in lw608-8. Plain text files are preinted ok, but dvi files just make the LaserWriter blink for a while. Jobs don't come out. The same jobs are printed ok when I send them to the ps printer. This is the first time I try to print a dvi file since your announcement of the new version of dvips, so.... Any clues? thanks, Mario. Log-Number: 30627 Date: Wed, 16 Jan 91 00:08:56 PST From: dlong@dogwood.ucsc.edu (Dean Long) Subject: directory entry not word aligned on sun4? What happens if the kernel comes across of corrupt directory, and one of the recordLength fields is not word aligned? It seems like that would cause it to crash on a sun4. I know that fscheck checks for this, but I don't think the kernel does. dl Log-Number: 30628 Date: Thu, 17 Jan 91 04:02:02 PST From: dlong (Dean Long) Subject: fscheck, -initialPart, and bootblocks I think the default partition to read the disk label should be the partition being checked, not necessarily partition a. The reason is fscheck will not find the domain header on partition c (for example) if partition c has 0 sectors allocated for the boot program, and partition a has 16. Fscheck will read the disk label from parition a. If it is a Sun label, it has to search for the domain header (of partition a). Then when it goes to get the domain header for partition c (the partition being checked), it uses the disk label from partition a, which says the domain header starts at sector 18, not 2. Most people probably don't worry about this, because they let fsmake allocated boot sectors for all their partitions, so the domain header is in the same place relative to the beginning of all the partitions. You can make fscheck work by specifying the -initialPart option, but that would mean putting a line for each parition in the mount file, instead of one line with "all" at the beginning. I don't see why you would ever want to read a disk label from a different partition than the one being checked, unless the label for the parition being checked is corrupted. dl Log-Number: 30632 From: mendel (Mendel Rosenblum) Subject: Anyone have any ideas on this Date: Fri, 18 Jan 91 17:05:41 PST ------- Forwarded Message Return-Path: bks@okeeffe.berkeley.edu Received: from ginger.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA134723; Wed, 16 Jan 91 15:51:39 PST Received: from okeeffe.Berkeley.EDU by ginger.Berkeley.EDU (4.1/1.42) id AA00244; Wed, 16 Jan 91 15:51:34 PST Received: by okeeffe.Berkeley.EDU (5.65/1.41) id AA00208; Wed, 16 Jan 91 15:50:28 -0800 Date: Wed, 16 Jan 91 15:50:28 -0800 >From: bks@okeeffe.berkeley.edu (Brian K. Shiratsuki) Message-Id: <9101162350.AA00208@okeeffe.Berkeley.EDU> To: mendel@ginger.Berkeley.EDU Subject: filesystem problems and pseudouser sprite Reply-To: bks@ucbarpa.berkeley.edu mendel, for the second time, the fsck program has found lots of unreferenced files on the /home/ginger/sprite filesystem and belonging to sprite. they were all symbolic links pointing to /, and they all appeared between 0200 and 0230. any idea where these are coming from, or who might know? thanks, brian ------- End of Forwarded Message Log-Number: 30633 Subject: Re: Anyone have any ideas on this Date: Fri, 18 Jan 91 17:22:40 PST From: Mike Kupfer <kupfer> Maybe they're related to the nightly rdist, which is fired off at 0200, and which has been failing lately because /home/ginger/sprite is full again? mike Log-Number: 30635 Date: Sat, 19 Jan 91 03:07:17 PST From: dlong (Dean Long) Subject: kernel memory leaks I went through most of the filesytem directories, looking for possible memory leaks, since prolonged disk activity causes our kernel to eat up all of memory (and crash). Here is what I found: file line allocated line lost fs/fsSysCall.c 71 74 80 86 762 765 fsconsist/fsconsistCache.c 2016 2074 fsprefix/fsprefixOps.c 2211 2221 "line allocated" is the line number where the memory was allocated, and "line lost" is line number of the (return) statement that causes the non-freed memory to be lost, usually because of some sort of failure. dl Log-Number: 30636 Date: Sun, 20 Jan 91 19:23:48 PST From: dlong (Dean Long) Subject: pmake, @ and - Pmake doesn't work right for commands with white space between the @ (or -) and the command. It seems like something like @ echo foo should be OK, at least in make-compatible mode. dl Log-Number: 30637 Date: Mon, 21 Jan 91 00:42:05 PST From: dlong@dogwood.ucsc.edu (Dean Long) Subject: bug in ping (or ipServer) Ping expects the packet it receives to have an ip header. ipServer does not include the ip header in the packet. Ping only works because the byte in the packet that it examines for the ip header length is zero. In other words, ping is just lucky. I think the bug is in ping, not ipServer. ipServer is consistent: the packets that go out on the socket have the same format as the packets that come in. Ping, however, sends packets of one format, and expects a different format. dl Log-Number: 30640 Date: Tue, 22 Jan 91 13:14:49 PST From: ss (Srinivasan Seshan) Subject: terrorism died error message: Fatal Error: CacheFileInvalidate, hashing error I rebooted it. I found in this state when I got into my office at 1:10PM on 1/22. Apparently, it had this problem around the same time as allspice was rebooted. ethan (using srini's account) Log-Number: 30642 Subject: xmh died trying to display new mail Date: Tue, 22 Jan 91 14:35:54 PST From: Mike Kupfer <kupfer> I deleted the last (highest-numbered) message in +inbox and did a "commit". I then did a "new mail" and "view next message". This put xmh into the debugger. I attached the process but couldn't get very far. XtMalloc doesn't call readv directly, at least as far as I can tell. (This may be caused by using a different xmh, because the one in /X11/R4/cmds.sun4 doesn't have symbols.) Also, perhaps it would help if Xt and Xaw were compiled with debugging turned on? mike -- MachPageFault: Bus error in user proc 12147, PC = 78dac, addr = 136e78 BR Reg 8080 #0 0x78dac in sigpause () #1 0x78b1c in readv () #2 0x2cc44 in XtMalloc () #3 0x12e78 in XawAsciiSourceChanged () #4 0x11cf4 in XmhPrintView () #5 0x32cc8 in XtInitializeWidgetClass () #6 0x33068 in XtInitializeWidgetClass () #7 0x33400 in _XtCreateWidget () #8 0x334b0 in _XtCreateWidget () #9 0x77e0 in CreateFileSource (...) (...) #10 0x63e4 in SetScrnNewMsg (...) (...) #11 0x66a4 in SetScrn (...) (...) #12 0x66cc in MsgSetScrn (...) (...) #13 0xd858 in NextAndPreviousView (...) (...) #14 0xd910 in DoNextView (...) (...) #15 0xd964 in XmhViewNextMessage (...) (...) #16 0x4ddc0 in _XtMatchUsingDontCareMods () #17 0x36364 in XtWindowToWidget () #18 0x36a08 in _XtOnGrabList () #19 0x36aa8 in _XtOnGrabList () #20 0x5774 in main (...) (...) Log-Number: 30643 Subject: copying text to xmh composition window causes infinite loop Date: Tue, 22 Jan 91 15:45:17 PST From: Mike Kupfer <kupfer> Copying the following text into an xmh composition window on sage puts xmh into an infinite loop. From slater@ucbarpa.Berkeley.EDU Wed Jan 16 17:45:38 1991 Date: Wed, 16 Jan 91 17:44:30 -0800 From: slater@ucbarpa.Berkeley.EDU (Mel Slater) To: bmiller@sprite.Berkeley.EDU Subject: rcp I occassionally need to transfer files from arpa to sprite and vice versa. "rcp" always gives me "permission denied". Is there any way around this? Mel. To reproduce it, there should be nothing in the body of the composition (i.e., nothing after the line of dashes). Copying it into the header, or copying it into a body that already has some text (even if it's just spaces) works okay. Copying the text one line at a time works okay. Copying the text into an xedit window works okay. gdb reports the following sample stack backtrace: #0 0x6cb9c in .div () #1 0x24c08 in _XawTextSetField () #2 0x24f78 in XawTextSinkMaxLines () #3 0x1d028 in _XawTextBuildLineTable () #4 0x201a0 in _XawTextShowPosition () #5 0x20290 in _XawTextExecuteUpdate () #6 0x20cec in XawTextSearch () #7 0x20edc in XawTextSearch () #8 0x4599c in XtDisownSelection () #9 0x45a70 in XtDisownSelection () #10 0x45b40 in XtGetSelectionValue () #11 0x210a0 in XawTextSearch () #12 0x210cc in XawTextSearch () #13 0x4e750 in _XtTranslateEvent () #14 0x365fc in XtWindowToWidget () #15 0x36cc0 in _XtOnGrabList () #16 0x36d5c in XtDispatchEvent () #17 0x5774 in main (...) (...) Further investigation shows that _XawTextShowPosition is the culprit, in particular the lines while (ctx->text.insertPos >= ctx->text.lt.info[lines].position) { if (ctx->text.lt.info[lines].position > ctx->text.lastPos) break; _XawTextBuildLineTable(ctx, ctx->text.lt.info[1].position, FALSE); } seem to be where the loop is. mike Log-Number: 30644 From: mendel (Mendel Rosenblum) Subject: Bug in Sys_Shutdown Date: Tue, 22 Jan 91 19:03:03 PST The large down time we had this morning was due to a bug in the code added to Sys_Shutdown() to sync the disks. The code should only sync the disk when the flags specified to. The reason is that fsattach reboots the system after checking the root disk. The addition of the writeback always code meant that problems on the root disk never got fixed. I've fixed this problem in the uninstalled sys module. Mendel Log-Number: 30650 Subject: migd wedged on pdev Date: Fri, 25 Jan 91 14:01:40 PST From: Mike Kupfer <kupfer> When I logged in last night "uptime" wasn't working on sage. Further investigation showed that different hosts were trying to become the global master but getting hung on the migd pseudo-device. I nuked /sprite/admin/migd/pdev and restarted migd on sage, making it the global master. Violence had been the master but apparently went into some infinite TLB fault loop earlier in the evening. Here's an excerpt from /sprite/admin/migd/global-log: 22a5e: Global daemon checkpoint: running on violence.Berkeley.EDU at Thu Jan 24 19:13:10 1991 22a5e: SaveCheckPoint - marking jaywalk down (curTime 664773190, updated 664773001). 22a5e: Global_HostDown(host=jaywalk(18), closed=0) called. 22a5e: SaveCheckPoint - checking terrorism foreign count. 22a5e: SaveCheckPoint - checking sabotage foreign count. 22a5e: PdevClose - daemon 24712 on host 71 exited 22a5e: Global_HostDown(host=lsisim(71), closed=1) called. 22a5e: PdevClose - daemon 1511c on host 81 exited 22a5e: Global_HostDown(host=hoot(81), closed=1) called. 22a5e: PdevClose - daemon 11d18 on host 29 exited 22a5e: Global_HostDown(host=sassafras(29), closed=1) called. 22a5e: PdevClose - daemon 40e1a on host 14 exited 22a5e: Global_HostDown(host=allspice(14), closed=1) called. 22a5e: Global_HostUp - sassafras pid 11d18 boot 663628415 version 16 maxProcs 1 22a5e: Global_HostUp - allspice pid 40e1a boot 664577163 version 1016 maxProcs 1 22a5e: PdevClose - daemon 13218 on host 50 exited 22a5e: Global_HostDown(host=subversion(50), closed=1) called. 22a5e: PdevClose - daemon 4192e on host 25 exited 22a5e: Global_HostDown(host=assault(25), closed=1) called. 22a5e: Global_HostUp - subversion pid 13218 boot 663610667 version 16 maxProcs 1 22a5e: Global_HostUp - assault pid 4192e boot 663744032 version 16 maxProcs 1 Global_Init - process 71f15 version 5 on host hijack.Berkeley.EDU: run at Thu Jan 24 19:15:15 1991 CreateGlobal - we are the global master, pid 71f15 71f15: Exiting: mismatch statting files: name inode <9615,-1>, version 45. 71f15: descriptor inode <9615,-1> version 44 (another migd running, or server changed file version). Global_Init - process d3c2e version 5 on host arson.Berkeley.EDU: run at Thu Jan 24 19:28:39 1991 d3c2e: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) Global_Init - process e4435 version 5 on host sedition.Berkeley.EDU: run at Thu Jan 24 19:28:41 1991 e4435: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) Global_Init - process 95148 version 5 on host hoot.Berkeley.EDU: run at Thu Jan 24 19:28:39 1991 95148: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) Log-Number: 30651 Date: Fri, 25 Jan 91 16:07:40 PST From: tve (Thorsten von Eicken) Subject: ununderstandable error message [crackle registry] cp mkfile ../002-kes cp: Unable to set Sprite user-file-type of dest [crackle registry] Huh? Log-Number: 30652 Subject: Re: ununderstandable error message Date: Fri, 25 Jan 91 16:32:19 PST From: Mike Kupfer <kupfer> Files in Sprite have a type associated with them. cp wants to propagate the type, but it failed, and it's too stoopid to say why. (I will fix this shortly.) mike Log-Number: 30653 Subject: unused declarations in time.h Date: Sun, 27 Jan 91 22:21:10 PST From: Mike Kupfer <kupfer> There seem to be a bunch of declarations in time.h that aren't used and aren't backed up by actual code. In particular, clock_t doesn't seem to be used anywhere except time.h, and the functions clock(), difftime(), and strftime() don't seem to be defined anywhere. Does anyone know what these declarations are for? Are they just ideas that nobody ever got around to actually implementing? What's the difference between clock_t and time_t? mike Log-Number: 30654 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 28 Jan 1991 12:27:08 PST Subject: directory rename Mendel and I fixed a bug that caused a handle to be released twice. This would happen if you tried to rename a directory, but the new name was a directory that already existed and wasn't empty. This doesn't normally happen due to the semantics of mv but I managed to write a program that did it. The fix is in the uninstalled fslcl. John Log-Number: 30656 Date: Wed, 30 Jan 91 16:09:18 PST From: gibson (Garth Gibson) Subject: migration trap I was using parsley (3100) as a host for a background pmake of many simulations when parsley trapped to the debugger. The message was: Fatal Error: Assertion failed: file "vmMigrate.c", line 265 SPRITE VERSION 1.081 (ds3100) (10 Jan 91 14:39:34) PC 0x800da62c I have rebooted the machine. garth Log-Number: 30657 Subject: ipServer died: freed memory twice Date: Thu, 31 Jan 91 17:26:04 PST From: Mike Kupfer <kupfer> After allspice came back this afternoon, I found that the IP Server on sage had died. The message in /hosts/sage/ip.out was Mem_Free: storage block already free Config file: /sprite/daemons/ipServer.config version: IPS 5/15/88 Here's the backtrace and a couple data structures: (gdb) bt #0 0x19a18 in Sig_Send () #1 0x15c58 in panic () #2 0x18654 in free () #3 0xe550 in TCP_SocketDestroy () (tcpSock.c line 853) #4 0xeaec in TCPCloseConnection (sockPtr=(Sock_InfoPtr) 0x9c3d8, tcbPtr=(TCPControlBlock *) 0x9c330) (tcpSock.c line 1294) #5 0xda3c in TCP_SocketClose (sockPtr=(Sock_InfoPtr) 0x9c3d8) (tcpSock.c line 171) #6 0x7eb4 in Sock_Close (privPtr=(struct Sock_PrivInfo *) 0x7b5d8) (sockOps.c line 409) #7 0x4fc8 in PdevRequestHandler (clientData=(ClientData) 0x7b5d8, streamID=31, eventMask=8) (main.c line 512) #8 0x144c4 in Fs_Dispatch () #9 0x4910 in main (argc=1, argv=(char **) 0x1dffff04) (main.c line 235) (gdb) up 3 #3 0xe550 in TCP_SocketDestroy () (tcpSock.c line 853) 853 free((char *) tcbPtr->templatePtr); (gdb) print *tcbPtr $2 = {reassList = {prevPtr = 0x9c330, nextPtr = 0x9c330}, templatePtr = 0xa7340, IPTemplatePtr = 0xa7360, connectPtr = 0x0, state = CLOSED, flags = 0, timer = {0, 0, 0, 0}, rxtshift = 0, rxtcur = 2, dupAcks = 0, maxSegSize = 556, idle = 0, rtt = 0, srtt = 8, rttvar = 2, rtseq = 3139176862, urgentData = 0 '\000', urgentBufPos = 0, force = 0, send = {unAck = 3139176863, next = 3139176863, window = 4096, urgentPtr = 3139176863, updateSeqNum = 145216002, updateAckNum = 3139176863, initial = 3139176862, maxSent = 3139176863, congWindow = 4652, cwSizeThresh = 65535, maxWindow = 4096}, recv = {next = 145216029, window = 4068, urgentPtr = 145216002, initial = 145216000, advtWindow = 145220097, maxWindow = 8191}} (gdb) up #4 0xeaec in TCPCloseConnection (sockPtr=(Sock_InfoPtr) 0x9c3d8, tcbPtr=(TCPControlBlock *) 0x9c330) (tcpSock.c line 1294) 1294 TCP_SocketDestroy((ClientData) tcbPtr); (gdb) print *sockPtr $3 = {protoLinks = {prevPtr = 0x98d28, nextPtr = 0x98b30}, protoIndex = 2, protocol = 0, protoData = 0x9c330, reqBufSize = 4164, requestBuf = 0x87378 "\e", state = CONNECTED, options = 44, owner = {id = 0, procOrFamily = 0}, flags = 16, clientCount = 1, error = 0, recvBuf = {links = {prevPtr = 0x9c410, nextPtr = 0x9c410}, size = 0, maxSize = 4096}, sendBuf = {links = {prevPtr = 0x9c420, nextPtr = 0x9c420}, size = 0, maxSize = 4096}, local = {addrFamily = 0, port = 513, address = 2149619206, padding = {"\000\000\000\000\000\000\000\000"}}, remote = {addrFamily = 2, port = 1020, address = 2149615622, padding = {"\035\377\373P\000\000\373@"}}, sentTo = {addrFamily = 0, port = 0, address = 0, padding = {"\000\000\000\000\000\000\000\000"}}, linger = 0, parentPtr = 0x622b0, clientList = {prevPtr = 0x9c468, nextPtr = 0x9c468}, justEstablished = 0} (gdb) up #5 0xda3c in TCP_SocketClose (sockPtr=(Sock_InfoPtr) 0x9c3d8) (tcpSock.c line 171) 171 TCPCloseConnection(sockPtr, tcbPtr); (gdb) up #6 0x7eb4 in Sock_Close (privPtr=(struct Sock_PrivInfo *) 0x7b5d8) (sockOps.c line 409) 409 status = protoInfo[sharePtr->protoIndex].ops.close(sharePtr); (gdb) print *privPtr $4 = {links = {prevPtr = 0x9c468, nextPtr = 0x9c468}, sharePtr = 0x9c3d8, streamID = 31, fsFlags = 36867, pid = 139544, hostID = 33, userID = 0, clientID = -1, recvFlags = 0, recvFrom = {addrFamily = 0, port = 0, address = 0, padding = {"\000\000\000\000\000\000\000\000"}}, sendInfo = {flags = 0, addressValid = 0, address = {inet = {addrFamily = 0, port = 0, address = 0, padding = {"\000\000\000\000\000\000\000\000"}}}}, sendInfoValid = 0} (gdb) print *sharePtr $5 = {protoLinks = {prevPtr = 0x98d28, nextPtr = 0x98b30}, protoIndex = 2, protocol = 0, protoData = 0x9c330, reqBufSize = 4164, requestBuf = 0x87378 "\e", state = CONNECTED, options = 44, owner = {id = 0, procOrFamily = 0}, flags = 16, clientCount = 1, error = 0, recvBuf = {links = {prevPtr = 0x9c410, nextPtr = 0x9c410}, size = 0, maxSize = 4096}, sendBuf = {links = {prevPtr = 0x9c420, nextPtr = 0x9c420}, size = 0, maxSize = 4096}, local = {addrFamily = 0, port = 513, address = 2149619206, padding = {"\000\000\000\000\000\000\000\000"}}, remote = {addrFamily = 2, port = 1020, address = 2149615622, padding = {"\035\377\373P\000\000\373@"}}, sentTo = {addrFamily = 0, port = 0, address = 0, padding = {"\000\000\000\000\000\000\000\000"}}, linger = 0, parentPtr = 0x622b0, clientList = {prevPtr = 0x9c468, nextPtr = 0x9c468}, justEstablished = 0} (gdb) up #7 0x4fc8 in PdevRequestHandler (clientData=(ClientData) 0x7b5d8, streamID=31, eventMask=8) (main.c line 512) 512 (void) Sock_Close(privPtr); (gdb) up #8 0x144c4 in Fs_Dispatch () (gdb) up #9 0x4910 in main (argc=1, argv=(char **) 0x1dffff04) (main.c line 235) 235 Fs_Dispatch(); mike Log-Number: 30658 Date: Thu, 31 Jan 91 23:27:38 PST From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice crash The disk seems to be failing on allspice, and this caused a crash: File blk 2259 phys blk 9036 Disk error UfsBlockRealloc: Bad Descriptor Block domain = 2, block = 9036 SCSI3#0 target 1 LUN0 media error info bytes 0x0 0x0 0x47 0xa7 Ofs_FileDescStore: couldn't write back desc Fslcl_DeleteFileDesc: couldn't mark desc as free Fatal Error: Fscache_FetchBlock hashing error. Log-Number: 30663 Date: Fri, 1 Feb 91 23:16:18 PST From: shirriff@dill (Ken Shirriff) Subject: Allspice crash Allspice crashed, apparently because mayhem was sending poison packets: a bunch of recovery with mayhem mayhem: Zero length parameter to reopen request Fatal Error: unaligned address trap in kernel I tried to debug allspice, but the mgbaker kernel on ginger doesn't seem to have symbols. I rebooted allspice, but it got to recovery with mayhem and then seemed to wedge, not getting to a login prompt. I tried to kill mayhem, but kmsg doesn't seem to exist on dill, and the king cluster doesn't recognize mayhem. I've rebooted allspice again, but it probably will go nowhere until someone can kill mayhem. Ken Log-Number: 30664 Subject: Re: Allspice crash Date: Sat, 02 Feb 91 13:02:07 PST From: Mary Baker <mgbaker> Allspice hit my mousetrap in Fsrmt_RpcReopen, for detecting when a reopen packet doesn't have the necessary parameters. After the test, it returns FAILURE immediately, so I have no idea what was unaligned. I somehow copied my stripped kernel to ginger rather than the unstripped. I just had to remove the installed source for 1.078 on ginger in order to make room for my unstripped kernel. It's there now if this happens again. (1.078 is the official "old" kernel, so probably nobody wants to look at that anyway.) Mary Log-Number: 30697 From: mendel (Mendel Rosenblum) Subject: Allspice crash Date: Mon, 11 Feb 91 11:36:53 PST Allspice crash again this morning. Same powercycle only problem. The messages before the crash were: Reinit receive unit. Reinit receive unit. Reinit receive unit. Dev_SyslogWrite: Buffer overflow ... Intel: Spurious interrupt (2) Intel: Spurious interrupt (2) Intel: Spurious interrupt (2) I rebooted allspice with the new kernel. It was running the old kernel. I started a debugger on shallot from dill with a breakpoint in panic(). If I'm not around when it crashes again someone should look at dill to see if it made it into the debugger. Also, it is possible that the crashes are related to the full dump I'm trying to do. The last four crashes have all been during the attempted dump of /user4. Mendel Log-Number: 30659 Date: Fri, 1 Feb 91 00:01:06 PST From: tve@ginger.Berkeley.EDU (Thorsten von Eicken) Subject: /tmp broken Allspice seems to have died a little while ago. No it's back. However /tmp is not usable. Any attempt to creat a file returns the error "file already exists". This has happened a week or two ago already. TvE Log-Number: 30660 From: Fred Douglis <douglis@cs.vu.nl> Subject: two mail bugs Date: Fri, 01 Feb 91 09:54:24 +0100 I logged into sprite and noticed I had mail. This is bug 1, since mail is supposed to be forwarded. Perhaps it happened when assault was down or something and my home directory wasn't accessible. It would be so nice if Sprite could handle this situation more gracefully, somehow. Bug 2 is that when I tried to read my mail, every time I ran Mail I hit something like: /usr/tmp/Rx608569: file already exists I checked one time, and the file did not exist. /usr/tmp points to /tmp, and /tmp seems to be accessible and world-writable, so I don't see what the problem is. Fred Log-Number: 30661 Date: Fri, 1 Feb 91 11:22:39 PST From: ouster (John Ousterhout) Subject: Disk space? I'm getting lots of messages in my syslog like this: 2/1/91 11:21:34 allspice (14) RmtFile "/sprite/spool/mail/bmiller" <10,2223> Write-back failed: out of disk space<40008> but when there appears to be lots of space available: Prefix Server KBytes Used Avail % Used / allspice 495968 409516 36855 91% Anyone have any ideas what's up? -John- Log-Number: 30662 Date: Fri, 1 Feb 91 11:25:28 PST From: ouster (John Ousterhout) Subject: Poisonous file? The file /sprite/spool/mail/bmiller appears to be poisonous: if I try to "ls -l" it, the ls hangs. -John- Log-Number: 30665 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 2 Feb 1991 18:00:08 PST Subject: allspice crashed Allspice crashed this afternoon with another of those hashing errors. It looks like allspice got a bunch of hard errors trying to read descriptor blocks off the /user1 filesystem. I don't know if these two things are related but I took no chances and unmounted the disk (after copying its contents to /scratch3). The /user1 filesystem is on the disk that used to contain /scratch3. The old /user1 disk is not mounted. There is a core file on ginger in the standard place (I haven't looked at it). When allspice rebooted it didn't go through recovery with quite a few machines. I did an "rpcstat -trace" and all it showed were a bunch of echo requests that weren't being answered. Since allspice had been down for over an hour I rebooted it to see if that would fix the problem. The second time it seemed to recover fine. John Log-Number: 30668 Subject: Trace_Dump can scrawl over user memory Date: Sun, 03 Feb 91 21:20:19 PST From: Mike Kupfer <kupfer> Consider the code if (traceHdrPtr->recordArray[current].flags == TRACE_UNUSED) { numRecs = current; earlyRecs = current; lateRecs = 0; } else if (numRecs > lateRecs) { earlyRecs = numRecs - lateRecs; } from Trace_Dump. Suppose there are 135 records in the trace buffer, but the user only wants 10. Further, suppose we haven't wrapped around the buffer yet, so that the TRUE part of the if is taken. numRecs is now set to 135, and that's how many records the user will get. If he only allocated space for 10, well, the probability of Bad Things happening just went up by quite a bit. mike Log-Number: 30669 Subject: how to debug migration problems Date: Sun, 03 Feb 91 21:40:51 PST From: Mike Kupfer <kupfer> I was having problems compiling stuff on sage earlier today. The basic symptom was that a bunch of the .o files wouldn't get created, and there wouldn't be an error message anywhere. This sounded like some problem we had in the past where jobs would get migrated to some sick host and then die. The question is, how do I find out which host is causing the problems? I don't want to go around disabling migration on Sparcstations until I find the culprit. Instead, I want some way to track missing .o files back to the host that they were supposed to be compiled on. I spent a good long time messing with "migcmd -t" and "migcmd -d", but I didn't produce anything that looked remotely useful. mike Log-Number: 30670 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: how to debug migration problems Date: Mon, 04 Feb 91 10:12:49 +0100 For pmake, it's a bit easier because you can run "pmake -d jr" and it will tell you where it sends each job. In general, increasing the debugging level for migration ought to give you some more info, but will be harder to tie to particular files. Tracing migration hasn't been done in so long that I'm surprised you didn't crash a machine, let alone produce anything remotely useful :-). Fred Log-Number: 30671 From: Fred Douglis <douglis@cs.vu.nl> Subject: possible SCSI register bug Date: Mon, 04 Feb 91 13:57:55 +0100 Greg Sharp, a programmer here, found a bug in the Amoeba SCSI driver, which had apparently been derived from the SunOS SCSI driver. He then checked, and sure enough, Sprite apparently has the same bug. (He then made noises asking why we hadn't been sued yet; I suppose one could ask him the same thing.) Anyway, it seems that the routine "WaitPhase" checks CBSR_PHASE_BITS but doesn't check to see that the request signal is asserted. In the fine print in the SCSI documentation, however, it says those bits are only valid when request is asserted, and there is at least one device Greg has found that asserts something looking like "message in" for a short period of time before moving on to assert something else. It might be worth looking into. This bug report brought to you as a public service of the folks at VU... :-) Fred Log-Number: 30672 Date: Mon, 4 Feb 91 10:08:45 PST From: johnw (John Wawrzynek) Subject: X server The X server is not running correctly on gluttony. I start it up with the xinit command and nothing happens. Thanks. -JohnW Log-Number: 30673 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 4 Feb 1991 11:04:39 PST Subject: Xserver flakiness X won't start up if the name server (ginger) is not working, as is the case at the moment. We are looking into the problem. John Log-Number: 30674 Subject: df broken on LFS Date: Mon, 04 Feb 91 13:16:48 PST From: Mike Kupfer <kupfer> df says there's 148892 blocks available, but I keep getting "write-back failed: out of disk space" messages. This is not a big deal as long as LFS is "experimental", but I think it should be fixed before we install LFS as a "production" system. mike Log-Number: 30675 Subject: allspice crash Date: Mon, 04 Feb 91 13:42:03 PST From: Mary Baker <mgbaker> Allspice died with the error "F". It could not be debugged or watchdog reset. It was powercycled instead. Mary Log-Number: 30685 From: mendel (Mendel Rosenblum) Subject: allspice crash Date: Wed, 06 Feb 91 20:46:00 PST Allspice was in the debugger when I came back from dinner with a "HandleRelease: file not locked" error message. The core file is in /home/ginger/raid/cores/raid/allspice.crash.2-6 if anyone wants to look. I rebooted it. Mendel ps. After allspice rebooted jaywalk couldn't find any commands, libraries, or X11 font directories. I tracked the problem down it allspice's route table not having a machine type (ie "sun4") for jaywalk. My guess is that it must of hit on the bug this caused netroute.new to not install routes correctly. I reran netroute.new and everything started working. Log-Number: 30676 From: mendel (Mendel Rosenblum) Subject: ds5000 considered harmful Date: Mon, 04 Feb 91 18:26:04 PST Forgery hung allspice in an infinite loop by dropping consist messages. I "ksmg -d" forgery and the problem cleared up. The syslog messages looked like: Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back & invalidate requests for "lw477-log" <10,92751> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Client 43 dropped 30 write-back requests for "lock" <10,92752> Mendel Log-Number: 30677 From: mendel (Mendel Rosenblum) Subject: Allspice crashed last night Date: Tue, 05 Feb 91 10:32:06 PST Allspice was down this morning when I came in. The problem appeared to be a corrupted directory /swap1/83/153 (83 == loiter). It seems that ds5000s are an evil machine. Fscheck repaired the damage when I rebooted it. While allspice was down joyride went into an infinite recovery loop with anise. I put joyride into the debugger. After allspice rebooted, forgery (another ds5000) went into an infinite recovery loop with allspice. I put forgery into the debugger. Mendel Log-Number: 30678 From: mendel (Mendel Rosenblum) Subject: Anise crash with disk full Date: Tue, 05 Feb 91 13:17:36 PST Anise panic'ed inside of LFS today with the code trying to update the attributes of an unallocated descriptor. The kgcore program from allsprite would not finish. It kept hanging part way thru the dump. The message before the crash indicated that the disk was full and file allocates and writes were failing. My guess is that the crash had something to do with this. Note that currently the LFS file systems will stop allocation when the disks are 75% full. This is why you get disk full messages when the disk is 75% full. Mendel Log-Number: 30679 Date: Tue, 5 Feb 91 13:38:33 PST From: shirriff (Ken Shirriff) Subject: Sendmail bug fixed I figured out why we got those mailer daemon messages about too many hops, with the multiple message to allspice from sprite. The problem was that mail to foo.bar@sprite would cause a loop. This was because sendmail was doing: foo.bar@sprite is name resolved to foo.bar@allspice.Berkeley.EDU. This doesn't match our name (sprite.Berkeley.EDU) so we'll send it on. Thus, there was a loop. The solution was to change sendmail.cf to check against the local machine before and after name resolution. If anything strange happens with mail, let me know. Ken Log-Number: 30680 Subject: lint in pmax X server causes crash Date: Tue, 05 Feb 91 18:01:11 PST From: Mike Kupfer <kupfer> The RAID guys are still having problems with their CAD software crashing the X server on DS3100s. I've tracked it down to the line (*pmPointer->processInputProc) (&motion, pmPointer); in mfbpmax_io.c:pmSetCursorPosition(). There are two problems with this line. The first is that the function being called expects 3 arguments, not 2. The second is a type clash at the second argument. The first problem was fixed in the very first R4 patch. Examination of our sources shows that only some of the changes from fix-1 are in our source tree. Either the patch was only partially applied, or some of the changes were later backed out. This makes me nervous--what other patches are there that we think we've incorporated but haven't really? I will post a query to xpert about the second problem. mike Log-Number: 30681 Date: Wed, 6 Feb 91 11:47:40 PST From: elm (ethan miller) Subject: problems with rn on sun4c For some reason, rn on my sparcstation refuses to recognize certain groups (the one I've noticed is soc.net-people). It claims they are bogus. rn on the ds3100 doesn't do this; nor does xrn on the sparc. ethan Log-Number: 30700 Subject: Re: problems with rn on sun4c Date: Mon, 11 Feb 91 12:57:22 PST From: Mike Kupfer <kupfer> The reason rn works on Decstations is that somebody installed a new version of rn, bringing it up to patchlevel 50. However, the new version was not installed on the Suns :-(. I tried recompiling for the Sparcstation, but there are compilation problems. I'll look into getting rn to compile, but I can't guarantee when I'll get to it. mike -- To: Mike Kupfer <kupfer@sprite.Berkeley.EDU> Subject: Re: bogus newsgroups on agate Date: Wed, 07 Nov 90 18:36:35 PST >From: rob@violet.berkeley.edu those are valid group. you might check the version of rn your using. do a control V inside of it. if it isn't patch level 47, the problem is a know bug with older versions of rn, that comes out when the number of groups agate subscribes to is over 1024. get your system administrator to install the latest copy of rn if this is the case. rob Return-Path: kupfer@sprite.Berkeley.EDU Received: from sage.Berkeley.EDU by violet.berkeley.edu (5.61/1.32 (TEMP)) id AA22493; Fri, 2 Nov 90 16:31:08 PST Received: by sprite.Berkeley.EDU (5.59/1.29) id AA729397; Fri, 2 Nov 90 16:31:11 PST Message-Id: <9011030031.AA729397@sprite.Berkeley.EDU> To: rob@violet.berkeley.edu Subject: bogus newsgroups on agate Date: Fri, 02 Nov 90 16:31:09 PST From: Mike Kupfer <kupfer@sprite.Berkeley.EDU> When I run rn, it asks me if I want to add a bunch of newsgroups. However, for quite a few of them, if I say "yes", it then says that they're bogus groups. Some of the groups are: fj.guide.admin fj.jus alt.books.technical mike Log-Number: 30684 From: mendel (Mendel Rosenblum) Subject: Re: Migration problem Date: Wed, 06 Feb 91 17:56:41 PST > Return-Path: shirriff > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA731687; Wed, 6 Feb 91 17:52:02 PST > Date: Wed, 6 Feb 91 17:52:02 PST > From: shirriff (Ken Shirriff) > Message-Id: <9102070152.AA731687@sprite.Berkeley.EDU> > To: bugs > Subject: Migration problem > > I'm running my simulator and using 60% of the CPU. However, if I leave > my machine idle, and then touch it, I get "Eviced 4 processes." So how > come Garth's processes are migrating onto my machine, even though I'm > using it heavily? > > Ken Doesn't the migration key off the load average (the average number of running processes) rather than the CPU utilization? If you are using only 60% of the CPU then your load average must be less than 1.0 so migrations might be accepted. Are you trying to say you don't want to share that last 40% with garth? Mendel Log-Number: 30688 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: Migration problem Date: Thu, 07 Feb 91 10:18:27 +0100 At some point, while I was working on the problem with machines floating up to a load of 1.0 for no reason, I changed the thresholds a bit. It's easier for things to wander onto a machine that has a steady load. Feel free to play with the thresholds (defined in the migd sources, but ultimately it would be nice to get them from a file or something). I never did track down why the loads got out of kilter, even with a fair amount of debugging info. If anyone ever finds out, please let me know. Fred Log-Number: 30686 Date: Wed, 6 Feb 91 23:18:20 PST From: msilva (Mario J. Silva) Subject: strange things are happening... I have a .forward file on sprite, so it was surprising to find after login a "you have mail message". I checked the mailbox and got a "no mail". I didn't care for a few days, until I decided to investigate 10 minutes ago. In /usr/spool/mail/msilva, I found a file with 7036 bytes containing what seems to me a log from migd with about 4k and, appended to that, a mail message directed to me. Mario. This is how the log looked like: Global_Init - process f2c4e version 5 on host mustard.Berkeley.EDU: run at Fri Feb 1 08:37:12 1991 run at Fri Feb 1 08:37:12 1991 run at Fri Feb 1 08:37:12 1991 run at Fri Feb 1 08:36:57 1991 run at Fri Feb 1 08:37:12 1991 run at Fri Feb 1 08:37:12 1991 2443e: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) 32154: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) 4930: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) d126d: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) 5484c: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) 62a5f: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) 32155: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) 3938: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) Global_Init - process 71937 version 5 on host assault.Berkeley.EDU: Global_Init - process 74f34 version 5 on host garlic.Berkeley.EDU: d3216: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) f2c4e: MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) run at Fri Feb 1 08:37:06 1991 Log-Number: 30687 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 6 Feb 1991 23:21:16 PST Subject: Re: strange things are happening... Our root disk got a little messed up a few days ago and you are seeing the result. Just delete your spool file and mail should work better. Sorry for the screwup. John Log-Number: 30690 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 7 Feb 1991 17:20:07 PST Subject: ds5000/200 cache weirdness I've run into a strange problem while getting the X server to work properly on the ds5000. As Ed suggested I looked at whether or not I was handling the cache correctly and I discovered that the shared data structures (shared between kernel and X server) were cached for the kernel, and uncached for the X server. I changed it so that they where both cached and I began to see strange behavior. Here are the details. The X server and the kernel share a queue of structures that are 3 words long. The x,y coordinates of the mouse are stored as two shorts in the first word of the structure. The cache line size is 4 words. This means that if an element is aligned on a cache line boundary, the last word of the cache line contains the x,y coordinates for the next element. This will happen to every 4th element. The strange behavior is that the X server will not see the writes to the x,y coordinates to the second element in a cache line. In particular, it appears as if the write by the kernel bypasses the cache and goes directly to memory. When the X server reads the element it already has the line in the cache so it reads the old x,y values. I verified that the values are indeed the values from the last time that the element was used. The x,y values are 2 bytes long, so writing them involves a read-modify-write operation on the memory location. The correct behavior is seen if you take one of these actions 1) make the data structures uncacheable, 2) flush the cache line before writing the values, 3) flush the cache line after writing the values, 3) read the values before writing them, and 4) read the values after writing them. Currently I am using approach #1, although 3 and 4 would be easy to implement. Does anyone have any ideas on what is going wrong here, or which solution is preferable? I can't figure out why it makes a difference if the preceding read of the cache line was done by the X server or by the kernel since the cache is physical, but it appears to be true. John Log-Number: 30694 From: mendel (Mendel Rosenblum) Subject: Allspice crashes somemore Date: Sun, 10 Feb 91 21:08:25 PST This message is to record thre more crashes of allspice, all of the no-debug-must-power-cycle type. This is the fifth crash of this type in the last 24 hours (Feb 9 22:05, Feb 10 12:40, Feb 10 16:14, Feb 10 19:30, Feb 10, 20:30). All but the 16:15 produced no interesting messages. The 16:15 at least produced a message: MachPageFault: Current process is NIL!! Trap pc is 0xf60a6670, addr 0xfffc8000. The pc is in the byte swap code in RPC module. JohnH and I think that a garbage packet caused this crash. There is (was ?) no validation of the length field of the RPC header. If it was too large it could cause the byte swap code to run off the end of the net recieve buffer. The first non-valid address after the net receive buffers is 0xfffc8000. JohnH put a patch in the uninstalled rpc module to check for this. After the reboot from the 160:30 crash, allspice hung up during recovery. I pulled the network interface and still nothing happened. I typed a l1-t to check the callback queue and sure enough the callbacks weren't being processed. Time was going forward and the current time was a couple of minutes passed the times in the callback queue. It was like the callback interrupt wasn't being processed. I typed l1-a and continued the machine and all the callbacks were processed. Weird! After the 20:30 crash I backed out to the sprite 1.079 kernel. The only crashes similar to this type was when the Jaguar board would get a bus error while processing an interrupt. No message was produced and the machine had to be powercycled. My guess it somehow allspice is getting a bus error during a interrupt handler. Maybe it is running into some problem in the net module. With the jaguar problem I was able to set a break point with the debugger in the panic() routine and it would hit the break point correctly. If this problem happens again I'll give this a try. Mendel Log-Number: 30695 Subject: 2 more reboots Date: Sun, 10 Feb 91 22:55:36 PST From: Mary Baker <mgbaker> I just rebooted allspice twice again. Both times it got a -1 use count on a stream and the first time, at least, it was going through repeated recovery with crackle. I've put crackle in the debugger. Although allspice printed out that it entered the debugger, it timed out with kgcore, so I was unable to get core images for it. At least I was able to reboot it with a watchdog reset, rather than power cycling the horrid beast. Mary Log-Number: 30699 Subject: sage crash Date: Mon, 11 Feb 91 12:56:38 PST From: Mary Baker <mgbaker> Sage died today in FsrmtFileClose with an exec ref count of -1. We've been seeing a lot of this recently. Mary Log-Number: 30701 Subject: vipw won't let me change some fields in master.passwd Date: Mon, 11 Feb 91 14:25:09 PST From: Mike Kupfer <kupfer> A user asked me to change his password for him. After doing so, I noticed that "passwd" had changed a couple fields in master.passwd from empty to "0". I wasn't sure if this was a problem, so I tried manually changing them back. However, vipw would claim that I hadn't made any changes, so it wouldn't update master.passwd. mike Log-Number: 30702 From: mendel (Mendel Rosenblum) Subject: Two more allspice crash Date: Mon, 11 Feb 91 21:10:52 PST Allspice hung up two more times. It didn't enter hit the panic breakpoint I had set. We're in trouble. Mendel Log-Number: 30704 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 12 Feb 1991 17:15:01 PST Subject: NIL ioHandlePtr Two machines (espionage and arson) have crashed today with the assertion filePtr->ioHandlePtr != (Fs_HandleHeader *) NIL failed. Perhaps we should bump up the priority of this bug before it gets too prevalent. >From what I can determine from the code the ioHandlePtr should never be NIL upon return from Fsio_DeencapStream, unless the status is not SUCCESS. Someone needs to take a better look. John Log-Number: 30708 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 13 Feb 1991 14:17:23 PST Subject: netEther.h The file netEther.h contains constants defining the minimum and maximum sizes of ethernet packets. From reading the Lance manual it seems that the minimum size is 64, and the maximum is 1518. netEther.h lists the minimum as 60 (I just changed it to 64) and the maximum as 1514. Is there any reason why these are 4 bytes off? John Log-Number: 30710 From: mendel (Mendel Rosenblum) Subject: Re: Mail broken Date: Thu, 14 Feb 91 12:16:10 PST > Return-Path: ouster > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA543527; Thu, 14 Feb 91 12:01:52 PST > Date: Thu, 14 Feb 91 12:01:52 PST > From: ouster (John Ousterhout) > Message-Id: <9102142001.AA543527@sprite.Berkeley.EDU> > To: bugs > Subject: Mail broken > > I don't think that mail is getting into Allspice right now (I tried > sending myself a message at Ginger and it didn't get through). Can > someone who knows about such things check out sendmail on Allspice? > > -John- The ipServer appeared to be messed up. It could talk out but not receive connections. I ran restartIPserver and everything appears to work. Mendel Log-Number: 30711 Date: Fri, 15 Feb 91 12:50:41 PST From: ouster (John Ousterhout) Subject: Allspice crash Whoops, I forgot to report this. When I came in this morning, Allspice was hanging RPC's from tyranny, but the RPCs weren't timing out. I took a look at Allspice, and its process table was full: lots of sendmail, mail, ftpd, and other processes. I killed off a few sendmails to get back some processes, and the process table situation seemed to improve, but tyranny still wasn't getting any response from Allspice, and when I tried "rpccmd -ping tyranny" from Allspice, then Allspice never finished the command and I couldn't control-C or control-Z out of the command. At that point I decided things were just too weird, so I sync-ed the disks and rebooted. -John- Log-Number: 30712 From: mendel (Mendel Rosenblum) Subject: Assault crash hoses allspice Date: Mon, 18 Feb 91 09:58:42 PST When assault crashes allspice also suffers. The problem appears to be users with home directories on assault get mail and finger requests. The sendmail and finger deamons that get forked to process the request hang until assault is rebooted. When I came in this morning allspice's process table was filled with sendmail and finger processes. Once the process table fills it hard to do anything with the machine. I was able to kill off some of the processes remotely from raid1 and get allspice usable again. Mendel Log-Number: 30713 Date: Mon, 18 Feb 91 10:07:44 PST From: shirriff (Ken Shirriff) Subject: Re: Assault crash hoses allspice Assault crashed and failed to end the debugger, so I had to reboot it. The console was repeatedly printing: ICMP echo Address error in load: Address 17 PC 800a33a0 Entering debugger with TLB load addr error Log-Number: 30714 Subject: ftp gets confused about transfer mode Date: Mon, 18 Feb 91 14:21:34 PST From: Mike Kupfer <kupfer> I got tripped up trying to retrieve a binary file from prep. Here's the sequence of events: (login) ftp> cd pub ftp> ls ftp> cd gnu ftp> ls The last thing ftp tells me is 226 Transfer complete. My next command is ftp> binary to which the response should be 200 Type set to I. Instead, I get Using ascii mode to transfer files. And in fact the mode was still Ascii; when I fetched a .Z file, it got corrupted. I tried ftp'ing from okeeffe to prep and could not reproduce the bug. I tried ftp'ing from sage to arpa and could reproduce the bug. Therefore, it looks like the bug is in our ftp client. The "ls" seems to be necessary to reproduce the bug. If I do "binary" as my first command, it works correctly. This bug might be related to the random error messages I get before I'm even able to type an ftp command, as in sage% ftp arpa Connected to ucbarpa.Berkeley.EDU. 220 ucbarpa.Berkeley.EDU FTP server (Version 5.47 Sun Aug 6 07:56:21 GMT 1989) ready. Name (arpa:kupfer): anonymous 331 Guest login ok, send ident as password. Password: 230 Guest login ok, access restrictions apply. Remote system type is UNIX. usage: type [ ascii | binary | image | ebcdic | tenex ] Using ascii mode to transfer files. ftp> mike Log-Number: 30715 Subject: rcp can create bogus hard links Date: Mon, 18 Feb 91 16:07:17 PST From: Mike Kupfer <kupfer> I decided to copy all of my personal 1/4" archive tape onto an Exabyte tape for easier access. So I created /r1/kupfer and got Ron Choi to copy the tape into hermes:/pic2/tmp/kufper [sic]. Ken has a hermes account, so he cd'd to /r1/kupfer and did something like rcp -r hermes:/pic2/tmp/kufper . There were some problems with unreadable files on hermes, so Ron chmod'd them and Ken did another, more selective, rcp to bring over the stragglers. When I went to remove old files before making the new tape, I found the following situation: sage% cd /r1/kupfer sage% ls -i 155216 Mail/ 101691 emacs/ 149032 orc/ 199360 News/ 117953 etc/ 117975 preferences 118027 Todo 133432 include/ 29664 src/ 194912 amoeba/ 117952 kufper/ 117960 termcap 76131 amusements/ 73808 lib/ 12424 tests/ 103424 bin/ 131696 mach/ 196280 xerox/ 117560 civ/ 84352 man/ sage% ls -i kufper 155216 Mail/ 101691 emacs/ 117975 preferences 199360 News/ 117953 etc/ 29664 src/ 118027 Todo 133432 include/ 117960 termcap 194912 amoeba/ 73808 lib/ 12424 tests/ 76131 amusements/ 131696 mach/ 196280 xerox/ 103424 bin/ 84352 man/ 117560 civ/ 149032 orc/ sage% file kufper kufper: directory sage% ls -di /r1/kupfer 66392 /r1/kupfer/ sage% ls -ld amoeba drwxr-xr-x 3 kupfer 512 Feb 18 12:30 amoeba/ sage% ls -al amoeba total 105 drwxr-xr-x 3 kupfer 512 Feb 18 12:30 ./ drwxrwxr-x 19 kupfer 512 Feb 18 11:37 ../ -rw-r--r-- 1 kupfer 61872 Feb 18 12:30 11.ps -rw-r--r-- 1 kupfer 35655 Feb 18 12:30 11a.ps Basically it looks like everything in kufper got an extra link to it, put in /r1/kupfer. mike Log-Number: 30716 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 18 Feb 1991 22:48:44 PST Subject: ds5000 network bug found It looks like part of the ds5000 network bug has been found. The bug has been with us all along -- one of the minor changes I made to the ds5000 network module made the bug deadly. Here's what happens. When the network driver gets a receive interrupt it goes into a loop processing the receive buffers. If a buffer is marked as owned by the host the routine Net_Input is called. When Net_Input returns the buffer pointer is incremented and the driver goes on to the next buffer. Now suppose that allspice sends us one of those bogus ack packets. Rpc_Dispatch will reset the network interface. When Net_Input returns the driver blindly goes on to the next buffer, unaware that the reset happened. Now the software and hardware think different buffers are the current buffer. In the current Lance implementation (sun4c) the next receive interrupt will cause one of those "Bogus receive interrupt" messages that will reset the chip. This time the pointers get reset correctly. When I first did the ds5000 port I was getting a lot of those bogus interrupts so I removed the reset since it was annoying and I couldn't see the need for it. Now I know. I changed the ds5000 driver so that it includes a counter of the number of resets. If this count changes during the call to Net_Input the driver breaks out of the loop. This fix doesn't explain the poison packets. I'm unable to come up with a scenerio in which the transmit pointers get out of sync. I'm inclined to believe that they may just be an artifact of resetting the network interface during packet transmission since they seem to happen when the network interface is reset a lot. I'll keep my eye on it anyway. John Log-Number: 30717 Date: Tue, 19 Feb 91 13:54:32 PST From: shirriff (Ken Shirriff) Subject: Corrupted cache? The cache on sassafras apparently got corrupted, changing a byte in one file from 00 to 02. Doing a "fscmd -f" restored the proper value. I found this problem very disconcerting, because my simulator would die on reading the bad value, when running on sassafras, but not on anything else. While I'm discussing strange bugs, the past two days violence has been dead in the morning with a blank screen, when I come in. It apparently totally wedges up in the night. Ken Log-Number: 30718 Subject: decman's X started up over mine Date: Tue, 19 Feb 91 17:24:40 PST From: Mike Kupfer <kupfer> There I was, happily working away on Sage, when all of a sudden the screen went gray, covered by the standard monochrome background pattern. Then strange things started appearing on my screen: somebody else's xterm... a mail icon... xphoon. They all looked to be owned by decman. Does our Sparcstation driver code refuse concurrent access to the screen? Does xinit do the right thing if the server can't start up? Is there a bug in decman's X startup script? mike Log-Number: 30719 Date: Wed, 20 Feb 91 13:13:55 PST From: shirriff (Ken Shirriff) Subject: Violence is flaky Violence suddenly wedged up again as I was using it. It didn't respond to L1-D or L1-A, so I couldn't figure out the problem. Ken Log-Number: 30720 Subject: rarp confusion Date: Wed, 20 Feb 91 18:02:32 PST From: Mike Kupfer <kupfer> Bob Miller got kvetching so that we could give away subversion. He asked to keep subversion as his machine name, so I swapped the ethernet addresses for subversion and kvetching in /etc/spritehosts. However, when I booted the new subversion, it still thought it was kvetching. I killed and restarted the arpd on allspice and rebooted again. It still thought it was kvetching. I eventually killed the arpd again and ran it -v (verbose) from Bob's office so I could see what it was doing. Of course, this time the machine came up as subversion. mike P.S. This probably isn't relevant to the bug, but while I had "arpd -v" running in Bob and Terry's office, I noticed a couple lines of the form "{RARP,ARP} with unknown protocol type: 0x500". Log-Number: 30721 From: mendel (Mendel Rosenblum) Subject: Re: rarp confusion Date: Wed, 20 Feb 91 18:06:58 PST > P.S. This probably isn't relevant to the bug, but while I had > "arpd -v" running in Bob and Terry's office, I noticed a couple lines > of the form "{RARP,ARP} with unknown protocol type: 0x500". 0x500 is the Sprite IP protocol type. The messages are from Sprite machines doing ARPs and RARPs for resolving the Sprite id to ethernet address mapping. Mendel Log-Number: 30724 Subject: more questions about code segment management Date: Wed, 20 Feb 91 22:19:45 PST From: Mike Kupfer <kupfer> (1) In FindCode (vmSeg.c), if vm_NoStickySegments is TRUE, then we assume that there is no segment already associated with this file. Is there some other check that ensures that there isn't a process using the segment? (My understanding is that vm_NoStickySegments is a proscriptive flag, saying "don't cache unused code segments". Have I got that right?) (2) If a process in FindCode doesn't find a segment for the given file, it marks the file--by setting the segment pointer to 0--to show that it is about to give the file a segment. If some other process wants a segment for the same file, it notices the 0 segment pointer and sleeps on codeSegCondition. However, the only wakeup call on codeSegCondition happens if the first process decides it's not going to give the file a segment after all. Doesn't this mean that if the segment *is* set up, the second process will sleep forever? mike Log-Number: 30725 From: mendel (Mendel Rosenblum) Subject: Re: more questions about code segment management Date: Thu, 21 Feb 91 10:09:48 PST > (1) In FindCode (vmSeg.c), if vm_NoStickySegments is TRUE, then we > assume that there is no segment already associated with this file. Is > there some other check that ensures that there isn't a process using > the segment? (My understanding is that vm_NoStickySegments is a > proscriptive flag, saying "don't cache unused code segments". Have I > got that right?) Isn't there a saying like "Beware of the path not taken" or something like that. In Sprite this translates to if code that is not normally executed looks like it wont work it is probably because it wont. Since we always run with vm_NoStickySegments == FALSE, it would not surprise me if it didn't work correctly set it TRUE. Anyway you are right, it looks from the code that the "vm_NoStickySegments" really means "vm_NoShareCodeSegments" > > (2) If a process in FindCode doesn't find a segment for the given > file, it marks the file--by setting the segment pointer to 0--to show > that it is about to give the file a segment. If some other process > wants a segment for the same file, it notices the 0 segment pointer > and sleeps on codeSegCondition. However, the only wakeup call on > codeSegCondition happens if the first process decides it's not going > to give the file a segment after all. Doesn't this mean that if the > segment *is* set up, the second process will sleep forever? > Although it will not be awoken promptly, it will not wait forever. The codeSegCondition is broadcasted on during railed exec(). I believe that the csh causes lots of failed exec() while looking down the search path. Also, recovery broadcasts on all conditions. Mendel Log-Number: 30726 Subject: questions about Vm_MakeAccessible Date: Thu, 21 Feb 91 12:01:20 PST From: Mike Kupfer <kupfer> My understanding of Vm_MakeAccessible is that it is used to verify that a particular part of a process's virtual address space is accessible. It then "locks" this region to ensure that it remains accessible until Vm_MakeUnAccessible is called. Two questions: (1) Is the above summary correct? (2) Why doesn't Vm_MakeAccessible do anything with the accessType parameter that is passed to it? mike Log-Number: 30727 Subject: multiprocessor race condition in exec code? Date: Thu, 21 Feb 91 16:38:07 PST From: Mike Kupfer <kupfer> If SetupVM (in procExec.c) finds a heap that doesn't end on a page boundary, it pages in the end of the heap and zeroes out the rest of the page. Is there any locking of the page there, or is there the (admittedly small) potential in a multiprocessor system for the page to get stolen between the Vm_PageIn and the bzero? mike Log-Number: 30728 Date: Thu, 21 Feb 91 16:52:46 PST From: elm (ethan miller) Subject: ^Z problem in rn on sparcStation ^Z no longer suspends rn on the sparcstation. I haven't checked ds3100 for this bug yet. ethan Log-Number: 30729 From: mendel (Mendel Rosenblum) Subject: assault crashes with same problem as last time Date: Fri, 22 Feb 91 11:36:38 PST Assault crashed with the same problem as its last crash. The console was repeatedly printing: ICMP echo Address error in load: Address 17 PC 800a33a0 Entering debugger with TLB load addr error I rebooted it. Mendel Log-Number: 30731 Subject: pdev.h versus pdev.new.h Date: Sun, 24 Feb 91 16:52:43 PST From: Mike Kupfer <kupfer> In /sprite/lib/include/dev one finds pdev.h and pdev.new.h. Are both necessary, or can one of them be deleted? mike Log-Number: 30732 Date: Mon, 25 Feb 91 10:02:53 PST From: tve (Thorsten von Eicken) Subject: hit ^Z at the login prompt ... and the login program goes into SUSP state. Pretty har to get out of if you can't login... TvE Log-Number: 30733 Subject: Mail delivery problems? Date: Tue, 26 Feb 91 00:00:55 PST From: Mary Baker <mgbaker> Is this just happening to me? Twice tonight my mail icon has beeped and also csh has said I have new mail. I go to read it, and it's not there. If there really was some mail, this is disconcerting. Mary Log-Number: 30734 Date: Thu, 28 Feb 91 16:22:35 PST From: elm (ethan miller) Subject: problems booting raid2 For the last day or two, we've been unable to boot raid2, either with the kernel I'm writing or with a standard sun4 kernel. The symptoms are: (raid2 prints this stuff) MEMORY x bytes allocated for kernel <open> 2/28/91 15:37:24 noname (67) RPC timed-out open of "cmds/initsprite" waiting for recovery 2/28/91 15:37:36 noname (67) - recovering handles 2/28/91 15:37:37 noname (67) Recovery complete 2 handles reopened Fsprefix_OpenCheck waiting for recovery Fsprefix_OpenCheck ok At this point, raid2 hangs. It doesn't enter the debugger or do anything else; it just hangs. raid2 has passed powerup selftest, so it's not likely to be a hardware problem. The hardware configuration has not changed since we were last able to boot it. This is a rather urgent bug; we can't work on Ultranet software if we can't boot raid2. ethan Log-Number: 30736 Subject: spritemon covers up scale Date: Fri, 01 Mar 91 19:04:39 PST From: Mike Kupfer <kupfer> Unless the measured number is bursty, spritemon tends to obscure the horizontal scale lines, which means you can't really tell what the value is that you're getting (at least on a monochrome display). It's all one big black blob. Here are a couple possible fixes: Option #1: use some sort of XOR scheme, so that the scale lines will show up video-inverted. Option #2: when the vertical lines get to the right hand edge, scroll back to the middle of the window, the way the regular xload (e.g., on ginger) does. mike Log-Number: 30737 Subject: pager program Date: Fri, 01 Mar 91 23:06:51 PST From: Mary Baker <mgbaker> There's no man page for the pager program in /sprite/cmds. I'll write one. But where is the soure for this program? It's not in /sprite/src/cmds... Mary Log-Number: 30740 Subject: problems booting sun3 Date: Sun, 03 Mar 91 17:49:07 PST From: Mike Kupfer <kupfer> Murder is taking an excessively long time to boot. From the etherfind dump, it looks like murder is dropping a lot of packets, causing a 2-second pause for each dropped packet. Is it possible that allspice is simply pushing the packets too fast? mike -- (allspice sends block 0x5b) 602.48 558 udp allspice.Berkel murder.Berkeley 1042 1348 08 00 20 00 fa 48 08 00 20 00 05 6d 08 00 45 00 02 20 00 d5 00 00 1e 11 6d 93 80 20 96 1b 80 20 96 09 04 12 05 44 02 0c 9a d1 00 03 00 5b 00 0c 4a 80 66 0e 22 2a 00 04 4c 6b 18 00 00 0c 4a 80 67 0a 42 94 20 3c 00 07 00 07 60 68 42 83 42 80 4a aa 00 04 6f 5c 78 02 2d 44 ff a4 42 ae ff ac 28 03 d8 aa 00 08 2d 44 ff a8 28 03 d8 92 2d 44 ff b4 28 2a 00 04 b8 ab (murder acks 5b) 602.50 60 udp murder.Berkeley allspice.Berkel 1348 1042 08 00 20 00 05 6d 08 00 20 00 fa 48 08 00 45 00 00 20 00 00 00 00 ff 11 8f 67 80 20 96 09 80 20 96 1b 05 44 04 12 00 0c 00 00 00 04 00 5b 6e 33 2e 6d 64 2f 6b 75 70 66 65 72 00 6f (allspice sends block 5c) 602.50 558 udp allspice.Berkel murder.Berkeley 1042 1348 08 00 20 00 fa 48 08 00 20 00 05 6d 08 00 45 00 02 20 00 d6 00 00 1e 11 6d 92 80 20 96 1b 80 20 96 09 04 12 05 44 02 0c 27 02 00 03 00 5c 6e 6f 74 20 66 6f 75 6e 64 00 53 65 65 6b 20 65 72 72 6f 72 00 44 4d 41 20 74 69 6d 65 6f 75 74 20 65 72 72 6f 72 00 57 72 69 74 65 20 70 72 6f 74 65 63 74 65 64 00 43 6f 72 72 65 63 74 61 62 6c 65 20 64 61 74 61 20 63 68 (murder acks 5b after timing out) 606.50 60 udp murder.Berkeley allspice.Berkel 1348 1042 08 00 20 00 05 6d 08 00 20 00 fa 48 08 00 45 00 00 20 00 00 00 00 ff 11 8f 67 80 20 96 09 80 20 96 1b 05 44 04 12 00 0c 00 00 00 04 00 5b 6e 33 2e 6d 64 2f 6b 75 70 66 65 72 00 6f (allspice resends 5c) 606.50 558 udp allspice.Berkel murder.Berkeley 1042 1348 08 00 20 00 fa 48 08 00 20 00 05 6d 08 00 45 00 02 20 00 d7 00 00 1e 11 6d 91 80 20 96 1b 80 20 96 09 04 12 05 44 02 0c 27 02 00 03 00 5c 6e 6f 74 20 66 6f 75 6e 64 00 53 65 65 6b 20 65 72 72 6f 72 00 44 4d 41 20 74 69 6d 65 6f 75 74 20 65 72 72 6f 72 00 57 72 69 74 65 20 70 72 6f 74 65 63 74 65 64 00 43 6f 72 72 65 63 74 61 62 6c 65 20 64 61 74 61 20 63 68 (murder acks 5c) 606.52 60 udp murder.Berkeley allspice.Berkel 1348 1042 08 00 20 00 05 6d 08 00 20 00 fa 48 08 00 45 00 00 20 00 00 00 00 ff 11 8f 67 80 20 96 09 80 20 96 1b 05 44 04 12 00 0c 00 00 00 04 00 5c 6e 33 2e 6d 64 2f 6b 75 70 66 65 72 00 6f Log-Number: 30741 Subject: meaning of vmStat.minFSPages Date: Sun, 03 Mar 91 18:35:34 PST From: Mike Kupfer <kupfer> I noticed that vmStat.minFSPages is always zero. This is because it is initialized to zero, and of course no cache size is ever less than zero. I fixed the VM code to initialize minFSPages to INT_MAX and hacked Vm_MapBlock so that it would update minFSPages as well as maxFSPages (so that minFSPages would always have a "truthful" value). Well, if I had thought about it for another 30 seconds, I would have realized that this isn't particularly useful, either, since it means that minFSPages will always be 1. So, it seems like the most useful number would be obtained by only updating minFSPages in Vm_UnmapBlock. This leads to the question of what value minFSPages should have before Vm_UnmapBlock is ever called (i.e., right after booting). Should it be INT_MAX or 0? (If it's INT_MAX, I'll probably want to hack vmstat to understand that, which is fine with me, but I'd like to first get consensus on what the value should be.) mike Log-Number: 30744 From: mendel (Mendel Rosenblum) Subject: Re: meaning of vmStat.minFSPages Date: Mon, 04 Mar 91 10:06:07 PST I believe that the memory occupied by the file cache is limit with min and max values in the file cache code. The relavent variables from fsStat.h are: /* * Cache size numbers. */ unsigned int minCacheBlocks; /* The minimum number of blocks that * can be in the cache. */ unsigned int maxCacheBlocks; /* The maximum number of blocks that * can be in the cache. */ unsigned int maxNumBlocks; /* The maximum number of blocks that * can ever be in the cache. */ unsigned int numCacheBlocks; /* The actual number of blocks that * are in the cache. */ unsigned int numFreeBlocks; /* The number of cache blocks that * aren't being used. */ Mendel Log-Number: 30743 Date: Mon, 4 Mar 91 09:45:43 PST From: root (The Sprite God) Subject: /pcs/tic broken? I have problems accessing /pcs/tic, commands just hang. It's in the prefix table, but can't get to it. Can someone please look into that? TvE Log-Number: 30745 From: mendel (Mendel Rosenblum) Subject: Test_PrintOut is a NOP on sun4 Date: Mon, 04 Mar 91 10:48:19 PST The routine Test_PrintOut() in the sys module has some problems. Besides violating the coding convention with its name (should start with Sys_), it was written in a very non-portable way. Because of this someone ifdef'ed it to just return on the sun4. This is unfortunate because initsprite uses Test_PrintOut() to report errors if it can't get /dev/console to work. I patched Test_PrintOut() to work on the sun4. Mendel ps. A short flame: Ifdef'ing a routine out for a single machine type without adding a warning printf can seriously waste someone's time. Commenting the act with "What are these routines for?" does little to improve the situation. Log-Number: 30746 Subject: unsigned times in Fscache_Block Date: Mon, 04 Mar 91 13:13:37 PST From: Mike Kupfer <kupfer> Is there some reason why timeDirtied and timeReferenced are unsigneds (in Fscache_Block)? Most time values are signed. Also, making timeReferenced unsigned leads to strange behavior if the VM bias is turned off. mike Log-Number: 30747 Date: Mon, 4 Mar 91 13:29:43 PST From: tve (Thorsten von Eicken) Subject: LFS disk full limit too low .. at least it seems to me. It says disk full on /pcs: Prefix Server KBytes Used Avail % Used /pcs anise 1013760 676355 236029 74% and there are more than 200 Megs unused. Seems like a bit too much, doesn't it? TvE Log-Number: 30748 From: mendel (Mendel Rosenblum) Subject: Re: problems booting raid2 Date: Mon, 04 Mar 91 13:35:32 PST > Date: Thu, 28 Feb 91 16:22:35 PST > From: elm (ethan miller) > To: bugs@sprite.Berkeley.EDU > Subject: problems booting raid2 > > For the last day or two, we've been unable to boot raid2, either with > the kernel I'm writing or with a standard sun4 kernel. The symptoms are: > (raid2 prints this stuff) > > MEMORY x bytes allocated for kernel > ... > At this point, raid2 hangs. It doesn't enter the debugger or do > anything else; it just hangs. It appears that the machine is hanging up very early in the startup of initsprite. My guess is it is hanging trying to open /dev/console. This is probably related to a known bug in the serial line driver for the sun. (See bugs number 01574, 01626, and 01627). If you try to open a serial line that doesn't have the proper RS232 signals asserted it will hang the machine by going into an infinite interrupt loop. The problem is the serial line driver doesn't properly ack certain types of interrupts. > raid2 has passed powerup selftest, > so it's not likely to be a hardware problem. If you believe this there is a nice red bridge near my apartment I can get you a really good deal on... > The hardware > configuration has not changed since we were last able to > boot it. This turns out to be false. Broads were switched. More importantly, the connectors hooking up the console to the serial line port were changed. > > This is a rather urgent bug; we can't work on Ultranet software > if we can't boot raid2. > The easiest way to get raid2 back online is to restore the cable and terminal setup to the one that worked. > ethan Mendel Log-Number: 30749 Date: Mon, 4 Mar 91 15:33:58 PST From: bmiller (Bob Miller) Subject: printer problem It seems that our printer, lw533, is hung. lpq on my machine shows 4 jobs waiting. lpq on shallot, which drives the printer, shows no entries. HELP!! Log-Number: 30750 Subject: Re: printer problem Date: Mon, 04 Mar 91 15:42:31 PST From: Mike Kupfer <kupfer> I did "lpc restart lw533", which seems to have unwedged things. The daemon had been running on mayhem. mike Log-Number: 30751 Date: Mon, 4 Mar 91 17:12:51 PST From: sethg (Seth Copen Goldstein) Subject: my machine is being consumed by sprite All day long response time has been pretty bad. this is pretty typical: roar:/pcs/tic/tam/tlc/002-sethg-08> ps -au | sort +2 -3 USER PID %CPU %MEM SIZE RSS STATE TIME PR COMMAND decman 75439 0.0 1.5 276 244 RWAIT 0:02 -csh decman 9542d 0.0 1.3 244 216 DEBUG 0:00 bin-spriteds/tl2ncube ... decman c542e 0.0 1.3 240 212 DEBUG 0:00 bin-spriteds/tl2ncube ... root 5434 0.0 0.0 168 0 WAIT 0:00 login -h ... root 1540b 0.0 --- --- --- EXIT 0:00 cmds/initsprite -b ... root 15418 0.0 0.8 280 124 RWAIT 0:50 /sprite/daemons/migd -D ... root 1541b 0.0 0.0 324 0 RWAIT 0:00 sendmail -bd root 1541d 0.0 0.5 120 84 WAIT 0:24 /sprite/daemons/cron root 1541f 0.0 0.0 212 0 RWAIT 0:01 /sprite/daemons/lpd root 45412 0.0 0.0 172 0 WAIT 0:01 /sprite/cmds.$MACHINE/lo... root 45417 0.0 0.6 176 104 WAIT 0:17 /local/cmds/getcounters ... root 6540f 0.0 0.0 140 0 RWAIT 0:01 /sprite/daemons/inetd ... root 75433 0.0 0.6 168 96 RWAIT 0:03 rlogind sethg 25422 0.0 0.6 848 100 RWAIT 0:02 xterm -n xterm_bot ... sethg 25427 0.0 1.0 260 156 RWAIT 0:02 csh sethg 25438 0.0 0.6 844 100 RWAIT 0:00 /X11/R4/cmds.ds3100/xter... sethg 2544d 0.0 1.0 760 172 RWAIT 0:01 xcal -geom 137x17+440+5 sethg 3543c 0.0 0.6 576 104 RWAIT 0:02 xbiff -geom 48x48+51-0 sethg 45436 0.0 0.4 112 64 WAIT 0:01 /emacs/cmds/loadst -n 60 sethg 55416 0.0 0.8 1016 136 RWAIT 0:02 xterm -n xterm_top ... sethg 6541a 0.0 0.6 848 100 RWAIT 0:01 xterm -n xterm_top ... sethg 6543f 0.0 1.0 260 156 WAIT 0:01 -csh sethg 6544c 0.0 0.8 568 124 RWAIT 0:01 xclock -rv -update 60 ... sethg 75410 0.0 0.6 844 100 RWAIT 0:01 xterm -title login ... sethg 7543a 0.0 0.0 148 0 RWAIT 0:00 rlogin mammoth sethg 75455 0.0 1.0 224 156 WAIT 0:00 /users/sethg/.xinitrc ... sethg 85411 0.0 1.0 232 156 WAIT 0:00 /users/sethg/.xsetup -f ... sethg 8541c 0.0 1.0 256 156 RWAIT 0:02 csh sethg 9541e 0.0 0.9 1188 140 RWAIT 0:03 xterm -n xterm_bot ... sethg 95420 0.0 1.3 796 220 RWAIT 0:02 xpostit -geom 64x10+476+0 sethg 95456 0.0 2.6 744 424 RWAIT 0:02 xconsole -unmapped sethg 9545d 0.0 0.8 1012 132 RWAIT 0:01 xterm -title login ... sethg a5435 0.0 2.3 1068 384 RWAIT 0:00 emacs -geometry 80x65-8-8 sethg a5451 0.0 0.0 168 0 WAIT 0:00 xinit sethg a545e 0.0 1.8 488 292 RWAIT 0:09 twm sethg b543b 0.0 0.0 148 0 RWAIT 0:00 rlogin mammoth sethg f544f 0.0 1.0 256 156 RWAIT 0:01 csh -i sethg 65437 0.1 1.4 1016 224 RWAIT 0:01 /X11/R4/cmds.ds3100/xter... sethg 35449 0.2 1.1 700 188 RWAIT 0:13 spritemon -geom ... sethg 5542b 0.2 2.5 1068 408 RWAIT 0:12 emacs -geometry 80x65-8-8 sethg 4542c 0.3 1.4 260 228 WAIT 0:03 /sprite/cmds/csh -i sethg 544a 0.4 0.9 552 140 RWAIT 0:10 xeyes -geom 48x48+206-0 sethg 95448 1.0 0.4 328 72 WAIT 0:00 sort +2 -3 sethg 45447 1.2 1.0 248 164 RUN 0:00 ps -au sethg a5452 11.0 6.0 1292 976 RWAIT 2:59 /X11/R4/cmds/Xmfbpmax :0 root 5540d 16.4 4.6 984 748 READY 25:12 /sprite/daemons/ipServer sethg 95413 2.5 8.1 7740 1332 RWAIT 1:36 emacs -geometry 80x65-8-8 roar:/pcs/tic/tam/tlc/002-sethg-08> it seems that root is getting alot of the machine alot of the time. What is wrong? Log-Number: 30752 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 5 Mar 1991 10:21:46 PST Subject: Re: my machine is being consumed by sprite Offhand I don't notice peculiar with your "ps". If your response is bad perhaps there is something else wrong. Were there any messages in your syslog? John Log-Number: 30753 Subject: Re: my machine is being consumed by sprite Date: Tue, 05 Mar 91 11:21:54 PST From: Mike Kupfer <kupfer> Allspice was sludgy at least part of the day (I noticed a lot of timed-out RPCs). Ken killed a runaway rn on allspice and performance seemed to get a lot better. mike Log-Number: 30755 From: mendel (Mendel Rosenblum) Subject: Re: printer crashes sun3 and sun4 Date: Tue, 05 Mar 91 12:32:28 PST > > Our printerr regularly crashes the machine it's attached to. This > happens about once a day. > Currently lw44 is hooked up to boing and from time to time > the syslog says "receiver overrun on /dev/serialB" and then > everything is dead. The printer used to be hooked up to > hoot (sun3) and the machine often hang (we never saw the > syslog). HELP! this is a pain! Does thi happen to you guys too? > TvE Can you give more details that "regularly crashes"? Does it go into the debugger? Do you think that it is the known problem with the serial line driver hanging in an infinite interrupt loop? You can test for this by typing l1-a and continuing the machine from the monitor. This should clear the problem. The only time I've seen larceny (the machine driving lw477) hang up is when the serial line falls out of the printer. Of course, we only have LaserWriter 0s and not LaserWriter 2s. Mendel Log-Number: 30756 From: tve (Thorsten von Eicken) Subject: Re: printer crashes sun3 and sun4 Date: Tue, 05 Mar 91 13:22:22 PST The machine "stops". No message about entering the debugger, no other message than the receiver overrun in the syslog. L1-A followed by C didn't do any good the last time I tried, I will try again. The cable doesn't fall out of the printer. Back when the printer was on a sun3 and that crashed, L1-A worked but the keyboard didn't anymore and we has to use the reset switch. TvE Log-Number: 30757 Date: Tue, 5 Mar 91 15:00:50 PST From: elm (ethan miller) Subject: bug in tcsh About one in three or four times I start a tcsh, it dies with the error message: "MachPageFault: Bus error in user proc xxxxx, PC = 1e208, addr = 0 BR Reg 80" It's always the same address. I've gotten it to do this twice under gdb (using tcsh -i), but I can't get a backtrace or any other hint of why tcsh is dying this way. The problem only occurs on the sun4c; I have never seen it happen on the Sun-4 or any Decstation. Any ideas on how to track this bug down? It's been around for months through various OS releases. ethan Log-Number: 30758 Subject: Re: bug in tcsh Date: Tue, 05 Mar 91 15:08:38 PST From: Mary Baker <mgbaker> Why does it give you no backtrace? Are you using the tcsh in /attcmds/tcsh/sun4.md/tcsh for debugging? It still has the symbol table. Mary Log-Number: 30759 From: mendel (Mendel Rosenblum) Subject: Disk errors in /pcs/tic Date: Tue, 05 Mar 91 17:19:27 PST The disk containing /pcs/tic appears to be in trouble. It can no longer write several sectors that contain file descriptors. The errors are: SCSI Disk SII#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0x27 0x98 SCSI Disk SII#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0x27 0x9c SCSI Disk SII#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0x43 0xe0 SCSI Disk SII#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0x43 0xe1 SCSI Disk SII#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0x47 0x98 SCSI Disk SII#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0x47 0x99 I was able to read these sectors but the writes fail. Mendel Log-Number: 30763 Date: Wed, 6 Mar 91 14:39:47 PST From: shirriff (Ken Shirriff) Subject: Allspice crash Allspice seemed to wedge up in consistency action with raid1. I tried to figure out what it was waiting on, but found my L1-i function has a bug, which killed allspice, so I rebooted. Log-Number: 30764 Date: Thu, 7 Mar 91 09:07:35 PST From: ouster (John Ousterhout) Subject: Assault crash Assault was mostly catatonic this morning when I came in (responded to various L1- commands, but not to other operations at the keyboard or to network packets). I rebooted it. -John- Log-Number: 30765 Subject: rpchist: should it work? Date: Thu, 07 Mar 91 11:44:53 PST From: Mike Kupfer <kupfer> I fixed rpchist so that it would compile. However, when I tried running it (turn it on, read some mail, turn it off), it wouldn't dump any counts out. Does anyone know if it's supposed to work? (I notice it's not currently installed in /sprite/cmds.) If it doesn't work, should we move it out of /sprite/src/cmds? mike Log-Number: 30766 From: mendel (Mendel Rosenblum) Subject: System-wide hangup - anise rebooted Date: Fri, 08 Mar 91 12:01:18 PST Anise was stricken with the bug that caused the callback queue to quit being procssed. This caused anise to hang RPCs to it and run out of RPC servers because it uses callbacks to reclaim servers after use. A couple of the RPCs it hung were consistency call backs from allspice. This caused RPCs to allspice to be hung. This made to hard to get any work done. I halted and rebooted anise and everything became unwedged for a few seconds before allspice went into a infinite loop printing the message: Client 80 dropped 30 write-back & invalidate requests for "userLog" <10,82857> I "kmsg -d" boing and everything cleared up. Big fun. Mendel Log-Number: 30767 Subject: allspice didn't boot; boot error messages Date: Fri, 08 Mar 91 20:42:42 PST From: Mike Kupfer <kupfer> Someone rebooted allspice just before we went to Raleighs. (I'll let them explain what had happened.) When I got back, allspice's console showed a bunch of "scsi disk busy" messages and the first couple lines that one normally sees at boot time, ending with "Machine type 0". The system wouldn't respond to the console, so I reset it and rebooted. There were a few error messages that I noticed while allspice was rebooting. (1) There were a bunch of complaints about attaching a local disk. These went flying off the console before I could read them or hit ^S. I assume this has something to do with finding the root partition; someone please correct me if this isn't the case. (2) There were complaints about "route to <n> not installed" for a couple different values of n. "no more routes" and "out of free routes!!" also appeared. (3) There were complaints about failed write-backs due to a full disk. However, after I was able to logon to allspice I did a "df" and found no full partitions. (4) joyride went into a series of recoveries with allspice, so I killed it and rebooted it. mike Log-Number: 30769 Date: Sat, 9 Mar 91 12:35:20 PST From: shirriff (Ken Shirriff) Subject: Re: allspice didn't boot; Allspice died yesterday afternoon with: Fscache_write: DISK FULL Ofs_FileTrunc Abandoning (indirect) block Fscache_FetchBlock hashing error Since this is a known bug, I rebooted. Log-Number: 30770 Date: Sat, 9 Mar 91 14:07:33 PST From: shirriff (Ken Shirriff) Subject: Assault crash Assault was crashed when I came in with: ClientCommand: write back msg failed 40012 Fatal Error: MemFree storage block already free. It failed to enter the debugger, so I couldn't track down the problem. I rebooted, but it went into an infinite recovery loop with allspice. I rebooted again, and everything seems to be fine now. Log-Number: 30773 Date: Mon, 11 Mar 91 14:12:43 PST From: bmiller (Bob Miller) Subject: printer problem Our printer, lw533, seems to be hung again. SHALLOT, which drives the printer, shows 'no entries'...SUBVERSION shows 'waiting for queue to be enabled on shallot.' Log-Number: 30774 Subject: Re: printer problem Date: Mon, 11 Mar 91 14:18:08 PST From: Mike Kupfer <kupfer> I restarted the printer daemon for lw533, which was running on espionage. mike Log-Number: 30776 Date: Mon, 11 Mar 91 18:09:54 PST From: dedood (Paul de Dood) Subject: rlogin burble I can't rlogin into burble from other machines (such as gluttony & buzz). I can rlogin into other machines but I can't rlogin to burble from those machines or from the machine I'm on (chips.csl.sri.com). Does anyone know what is wrong, or how to rectify the situation? Thanks, Paul. Log-Number: 30778 Subject: Re: rlogin burble Date: Mon, 11 Mar 91 20:56:17 PST From: Mike Kupfer <kupfer> The IP Server on burble is probably wedged. Somebody needs to log in at the console and restart it. mike Log-Number: 30781 Date: Thu, 14 Mar 91 11:48:57 PST From: tve (Thorsten von Eicken) Subject: error at end of reboot I just rebooted crackle (sun4c) and got "Initsprite script exited abnormally" (or so). I also have a "csh -i" owned by root keeping the cpu busy at 100%. TvE Log-Number: 30782 Subject: Re: error at end of reboot Date: Thu, 14 Mar 91 12:03:35 PST From: Mike Kupfer <kupfer> The last thing executed in /hosts/crackle/bootcmds is "vmcmd -F -300". A quick check of vmcmd.c shows that the exit status is never set (main() doesn't return or call exit()). Does anyone know why none of the Vm_Cmd invocations in vmcmd check the return status? Is vmcmd always supposed to exit with a status of 0? I don't know what caused the looping csh. mike Log-Number: 30787 Date: Fri, 15 Mar 91 11:14:37 PST From: tve (Thorsten von Eicken) Subject: run-away csh on reboot problems remain Was this supposed to be fixed yesterday? I just rebooted and again had a run-away csh -i. Boing had one too. TvE Log-Number: 30783 Date: Thu, 14 Mar 91 12:55:00 PST From: tve (Thorsten von Eicken) Subject: nfsmount problems We got our 88k box up again and I'm trying to nfsmount the disk. This used to work but doesn' anymore. The nfsmount somehow dies immediately after start-up. assault-8# ls -ls /rumble total 1 1 rrwxrwxrwx 1 root 11 Aug 16 1990 u1^ -> /rumble/u1 assault-9# nfsmount -t rumble:/u1 /rumble/u1 Attributes of rumble:/u1 FileID 2 FS_ID 5203 Type 2 mode 040775 links 17 size 272 RootID <-1,0,20995,2> assault-10# ps -ax | egrep nfs Unknown option "-x"; type "ps -help" for information c195c RWAIT 0:03 nfsmount woosh:/usr/ncube /usr/ncube 7195a RWAIT 0:01 nfsmount ginger:/var/spool/msgs /sprite/spool/msgs 81955 RWAIT 0:00 nfsmount ginger:/home/ginger/sprite /home/ginger/sprite 4193b RWAIT 0:00 /sprite/daemons/unfsd 81950 RWAIT 0:00 nfsmount ginger:/home/ginger/users /home/ginger/users a1957 RWAIT 0:00 nfsmount ginger:/home/ginger/spare /home/ginger/spare c195b RWAIT 0:00 nfsmount ic:/octtools /ic/octtools 1195d RWAIT 0:00 nfsmount woosh:/woosh/tic /woosh/tic 9195e RWAIT 0:00 nfsmount hermes:/a /postgres/a d1954 RWAIT 0:00 nfsmount ginger:/home/ginger/raid /home/ginger/raid 7194c RWAIT 0:00 egrep nfs assault-11# Log-Number: 30786 Date: Thu, 14 Mar 91 21:24:29 PST From: shirriff (Ken Shirriff) Subject: Allspice crashed Allspice crashed this evening with a use count = -1 error. I think the problem was due to a bug in the kernel I was running on sassafras, which messed up process migration. Ken Log-Number: 30788 Subject: Re: mail Date: Fri, 15 Mar 91 11:46:15 PST From: Mike Kupfer <kupfer> > Date: Fri, 15 Mar 91 10:56:01 PST > From: dfb (David F. Bacon) > To: root > Subject: mail > > twice in the last few days sprite has come up and sent me mail about files in > /lost+found, which turn out to be undelivered mail messages. is this a > transient bug, or should i not rely on mail sent from sprite hosts being > delivered? > > david Have you checked with the recipient to see whether the mail was actually delivered? I think it is possible that the files you are seeing are in fact just copies. Also, please send queries like this to "bugs". The same people will receive this message, but if you send mail to "bugs" it gets logged and we will be sure to discuss it at the weekly Sprite meeting. thanks, mike Log-Number: 30789 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 15 Mar 1991 12:50:15 PST Subject: Re: run-away csh on reboot problems remain This is a known bug that has been around for a while. If the last thing in your bootcmds exits with a non-zero status it puts the csh in an infinite loop. We will discuss it at the meeting today, but I'm not sure any of us is too eager to mess with the csh sources. John Log-Number: 30790 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 15 Mar 1991 18:16:42 PST Subject: nfsmount The nfsmount of ginger:/home/ginger/spare completely disappeared from assault. Didn't even go into the debugger. John Log-Number: 30791 Subject: "not found" messages doing ls in /pcs Date: Sat, 16 Mar 91 15:46:53 PST From: Mike Kupfer <kupfer> There were some error messages in the dump log of the form Dump: can't lstat /pcs/tic/tam/tlc/002-adam-01/check.c: invalid argument I tried looking at 002-adam-01. "echo *" listed the following files: #machine.h# N README README.const bin-sprite3 bin-sprite4 bin-spriteds check.c check.h copyright gen-c.c gen-c.h gen-mips.c gen-mips.c~ gen-mips.h gen-ncube.c gen-ncube.h instr.c instr.c~ instr.h instr.h~ machine.c machine.c.old machine.c~ machine.h machine.h~ machine.o mail main.c mkfile mkfile-sprite3 mkfile-sprite3~ mkfile-sprite4 mkfile-spriteds mkfile~ opcode-alu.h opcode-check.h opcode-gram.h opcode-sym.h opcode-tok.h opcode.awk opcode.src opcode.txt parser-gen.y parser.c parser.h parser.tab.h parser.y scanner.c scanner.h symtab.c symtab.h th tree.c tree.h type.c type.h var.c var.h "ls" gave me check.c not found gen-ncube.h not found gen-ncube.c not found mail not found mkfile-spriteds not found README not found N not found gen-c.h not found gen-c.c not found README.const not found parser.c not found parser.tab.h not found mkfile-sprite3 not found mkfile-sprite3~ not found gen-mips.h not found gen-mips.c~ not found machine.h not found mkfile~ not found machine.h~ not found #machine.h# instr.c~ mkfile opcode.src symtab.h bin-sprite3/ instr.h mkfile-sprite4 opcode.txt th@ bin-sprite4/ instr.h~ opcode-alu.h parser-gen.y tree.c bin-spriteds/ machine.c opcode-check.h parser.h tree.h check.h machine.c.old opcode-gram.h parser.y type.c copyright machine.c~ opcode-sym.h scanner.c type.h gen-mips.c machine.o opcode-tok.h scanner.h var.c instr.c main.c opcode.awk symtab.c var.h Can someone explain to me what's going on here? thanks, mike Log-Number: 30794 From: mendel (Mendel Rosenblum) Subject: Re: "not found" messages doing ls in /pcs Date: Sun, 17 Mar 91 13:09:04 PST > Can someone explain to me what's going on here? > > thanks, > mike The disk is broken so the file system gets errors trying to update descriptor blocks. It looks like the dump program found a directory containing files whose descriptors couldn't be written because of disk errors. We might consider taking /pcs/tic offline because Sprite is not very robust in the face of disk errors. It might panic(). Mendel Log-Number: 30798 Subject: Re: "not found" messages doing ls in /pcs Date: Sun, 17 Mar 91 20:40:54 PST From: Mike Kupfer <kupfer> Actually, /pcs/tic is now on the /pcs partition. The former /pcs/tic partition is now /pcs/scratch. Does this mean that the /pcs partition is now suspicious? mike -- Date: Sat, 9 Mar 91 02:46:35 PST >From: tve (Thorsten von Eicken) To: sprite Subject: disk /pcs/tic renamed to /pcs/scratch Mendel advised me that /pcs/tic seems getting ready to shred the data. I renamed the disk to /pcs/scratch and moved important stuff off and temp stuff on. I did "prefix -U /pcs/tic; prefix -M /dev/rsd04c /pcs/scratch", I fixed /hosts/assault/mount and /sprite/src/admin/{daily,weekly}dump. I couldn't figure out how to broadcast a "prefix -d /pcs/tic" to all machines (there now is a regular directory /pcs/tic), so I expect some confusion in the near future. Did I miss anything? TvE Log-Number: 30799 Date: Sun, 17 Mar 91 21:55:26 PST From: tve (Thorsten von Eicken) Subject: Re: "not found" messages doing ls in /pcs At the time I did "update -Oq /pcs/tic /pcs/tic.new" and then remounted /pcs/tic as /pcs/scratch and renamed /pcs/tic.new to /pcs/tic. So I guess update just moved the trash to the other disk. The directory which causes problems can be deleted, but I would appreciate if someone knowledgable could do it. What should we do with /pcs/scratch? How 'bout reformatting the disk so that the id sectors get rewritten? TvE Log-Number: 30792 Date: Sat, 16 Mar 91 15:47:07 -0800 From: dfb@bastille.berkeley.edu (David F. Bacon) Subject: mail i did check with the recipients and the mail was not delivered. david Log-Number: 30793 Subject: raid1 reboot Date: Sat, 16 Mar 91 22:48:20 PST From: Mike Kupfer <kupfer> I had to reboot raid1. It seemed to be in some sort of state where accesses to other servers would cause a consistency error. The console was filled with complaints about being unable to write back the raid1 counters file because of a consistency error. mike Log-Number: 30796 Date: Sun, 17 Mar 91 17:20:59 PST From: mottsmth (Jim Mott-Smith) Subject: DS5000 Isn't Sprite supposed to know that a DS5000 can use DS3100 object code? When I say 'make' it apologizes and says that the ds5000 is not in the list of legal target machines. If I say 'make TM=ds3100' it's quite content to build a 3100 object file. -- Jim M-S Log-Number: 30797 Subject: Exabyte refused to load tape Date: Sun, 17 Mar 91 20:27:06 PST From: Mike Kupfer <kupfer> The Saturday night dump failed, with the following errors in allspice's syslog: Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: hardware error - info bytes 0x0 0x0 0x0 0xed Warning: Exabyte tape not present Warning: Exabyte Servo System error, catastrophic failure! Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: hardware error - info bytes 0x0 0x0 0x0 0xed Warning: Exabyte tape not present Warning: Exabyte Servo System error, catastrophic failure! When I checked the drive, the dump tape had been ejected. When I put the dump tape or any other tape back in, it was ejected after a couple seconds. Power cycling the drive seems to have fixed things up. mike Log-Number: 30800 Date: Sun, 17 Mar 91 23:32:58 PST From: elm (ethan miller) Subject: problems with mail? These two mail messages appeared as one message last night. I assume the "From" line on the second message got lost somehow, and there may be text missing after the "--" in the first message. ethan %From kupfer Sun Mar 17 20:41:01 1991 %To: mendel %Cc: bugs %Subject: Re: "not found" messages doing ls in /pcs %In-Reply-To: Your message of Sun, 17 Mar 91 13:09:04 -0800 %Date: Sun, 17 Mar 91 20:40:54 PST %From: Mike Kupfer <kupfer> %Actually, /pcs/tic is now on the /pcs partition. The former /pcs/tic %partition is now /pcs/scratch. Does this mean that the /pcs partition %is now suspicious? %mike %-- %Date: Sat, 9 Mar 91 02:46:35 PST %From: tve (Thorsten von Eicken) %To: sprite %Subject: disk /pcs/tic renamed to /pcs/scratch %Mendel advised me that /pcs/tic seems getting ready to shred the data. %I renamed the disk to /pcs/scratch and moved important stuff off and %temp stuff on. %I did "prefix -U /pcs/tic; prefix -M /dev/rsd04c /pcs/scratch", I fixed %/hosts/assault/mount and /sprite/src/admin/{daily,weekly}dump. I couldn't %figure out how to broadcast a "prefix -d /pcs/tic" to all machines (there %now is a regular directory /pcs/tic), so I expect some confusion in the %near future. %Did I miss anything? % TvE Log-Number: 30801 Date: Mon, 18 Mar 91 13:58:53 PST From: ouster (John Ousterhout) Subject: FTP connections refused Twice today FTP to allspice has wedged up so that attempts to connect result in the following message from FTP: ftp: connect: connection refused I restarted the IP server this morning, and I'm about to restart it again. For some reason, this bug doesn't seem to affect rlogins or mail. -John- Log-Number: 30802 Subject: Re: FTP connections refused Date: Mon, 18 Mar 91 14:12:35 PST From: Mike Kupfer <kupfer> Frequently when this happens, there is a line in the syslog like <28>Mar 18 11:26:18 inetd[90e58]: ftp/tcp accept: invalid argument A quick check of /sprite/syslogs/allspice.Berkeley.EDU also shows <18>Mar 18 12:31:39 sendmail[10e5a]: NOQUEUE: SYSERR: getrequests: accept: invalid argument <18>Mar 18 12:35:08 sendmail[10e5a]: NOQUEUE: SYSERR: getrequests: accept: invalid argument <18>Mar 18 12:38:30 sendmail[10e5a]: NOQUEUE: SYSERR: getrequests: accept: invalid argument <18>Mar 18 12:39:00 sendmail[10e5a]: NOQUEUE: SYSERR: getrequests: accept: invalid argument and sure enough, there is a bunch of mail on ginger queued up for allspice because allspice's sendmail wasn't talking to anyone. Perhaps the IP server is mismanaging the socket that the application is doing the accept() on. mike Log-Number: 30803 Subject: problems compiling Date: Mon, 18 Mar 91 16:46:23 PST From: Mary Baker <mgbaker> Frequently but randomly, I'm getting the error MemChunkAlloc couldn't extend heapcc: Program cpp got fatal signal 9. when trying to compile kernel sources. I am compiling on sparcstations using the "new" 1.084 kernel. I didn't have this problem before with this kernel. Does anyone have a clue what's happening here? (I'm running a different kernel on my machine, but I turned off importing migrated processes to my machine.) Mary Log-Number: 30804 Date: Mon, 18 Mar 91 17:28:38 PST From: ouster (John Ousterhout) Subject: More compilation problems I too have been experiencing the same pmake problems that Mary reported, plus at least one other that I can't put my finger on. When I started typing "pmake -X" the problems seem to have stopped. I'm wondering if there's any chance that Ken's test kernel for UNIX compatibility is allowing migrations to itself, and if this might perhaps be the cause of the problems? As I remember, Bob had to change the migration version in his test kernels to prevent problems. -John- Log-Number: 30805 Date: Mon, 18 Mar 91 17:32:46 PST From: shirriff (Ken Shirriff) Subject: Re: More compilation problems Oops. Sorry. Sassafras had a bunch of messages about killing migrated processes, so I think my new kernel has problems with migration. I'll change the migration version for my kernel. I've rebooted sassafras with an old kernel, so things should work properly now. Ken Log-Number: 30806 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 18 Mar 1991 18:09:36 PST Subject: bugs fixed in vm I fixed two bugs in the vm system. First, the coreMap was not initialized properly during boot. On the ds5000 this caused some of the bits to be set to bogus values (perhaps the other machine types zero their memory before boot?). Lots of the pages had the "wireCount" field set to a random number, so that they could not be deleted from a segment. Second, the code that deleted pages from a segment did not work correctly if the "wireCount" was greater than zero. It would bail out without cleaning up properly, so that when the process tried to exit it would end up waiting forever on a condition. John Log-Number: 30807 Date: Tue, 19 Mar 91 09:13:52 PST From: ouster (John Ousterhout) Subject: Re: changes to rpc In response to John's message: When I booted my ds5000 I got the following messages: Broadcasting for server of "/" RpcDoCall: <prefix> RPC to broadcast is hung <prefix> RPC ok I've never seen this before. Could this be related to the changes you made to the RPC system? Right: this replaces the "x hanging my broadcast message", with the positive side-effect that the RPC system doesn't return an error to its caller, but keeps trying and very quickly succeeds. I believe that what's really going on is this: 1. A client sends a broadcast packet. 2. The server either doesn't get the packet, or takes a bit too long to service it. 3. The client retransmits its request, with the "please ack" bit set. 4. Every machine on the network gets the retransmitted broadcast and acks it. 5. The client gets a zillion acks. 6. The client thinks it's getting too many acks (i.e. the server has been taking too long to respond: the client doesn't distinguish between N acks spaced out a few seconds apart and N acks received in a 100-ms interval) . For normal RPCs, it just prints a message like the one above and keeps trying. For broadcasts, the RPC system used to return an error. I changed it so that it treats broadcasts the same as other RPCs, and keeps trying. I suppose the real solution is to change the client so that it realizes that it's gotten a zillion acks in a short interval and doesn't get upset. But this seems like a much more substantial change, so I didn't do it. -John- Log-Number: 30809 Date: Tue, 19 Mar 91 16:10:51 PST From: elm (ethan miller) Subject: yet more printer problems I can't print a (certain) PostScript file to lw608-8. This file does print on lw508-5, which is run off Unix. When I try to print to lw608-8, it shows up in the print queue and then disappears some time later (with no paper showing up). As far as I know, the two printers are both the same model of laserwriter. The files in question are in /home/ginger/raid/viewlogic/raidII/hippi/*.ps. Does this have anything to do with some header that Unix might prepend and Sprite not prepend (or the other way around)? thanks ethan Log-Number: 30811 Subject: roar won't boot Date: Wed, 20 Mar 91 23:39:39 PST From: Mike Kupfer <kupfer> Seth Goldstein (x3-7566) tells me that roar won't boot. It gets to "starting RPC service" and then goes into an infinite loop of (approximately) FsrmtPseudoDeviceVerify, client 43 not known for lcl pdev <e000a, 78a75a60> I discovered that this is being generated by FspdevRmtPseudoStreamVerify, but I couldn't tell just from looking at the code what was supposed to be happening. I tried zapping all the pseudo-devices in /hosts/roar, but that didn't help. Seth said he's dead in the water until this is fixed (and I have to leave in a few minutes), so could someone look at this in the morning? Seth said he'd come in early to boot roar so that someone could debug it. mike Log-Number: 30812 Subject: problems with allspice reboot Date: Thu, 21 Mar 91 00:41:54 PST From: Mike Kupfer <kupfer> I took allspice down to reboot it because the dump job had hung Tuesday night/Wednesday morning and couldn't be killed. I ran into a few problems rebooting it. (1) There is nothing in the allspice "howto" sheet telling how to skip disk checks ("-f"). Should this information go in each server's howto, or is there a more general document that it should go in? (2) Allspice hung the first time it came up. The last message in the syslog said something about a consist RPC with coons hanging. (3) While I was trying to figure out what to do about coons and allspice, the following message appeared on allspice's console: OfsFragFree bitmap=<3f> checkMask=<c0> OfsFragFree: block not free, block 59930, numFrag 2, offset 0 Does this mean that despite my manually syncing allspice and then running "shutdown", the disk still wasn't brought down cleanly? (4) I tried putting coons in the debugger to see if that would unwedge allspice. (a) There's no ds3100 kmsg in ~sprite on dill. (b) The sun3 kmsg in ~sprite is almost 3 years old (it doesn't even understand "-d")--not that I expected a sun3 kmsg to work with coons, which is a ds5000, anyway. I eventually L1-A'd allspice and rebooted. (By the way, would anyone complain if I zap ~sprite/bin? It always points to cmds.sun3, which isn't too helpful if you're on, say, dill or shallot.) (5) The instructions for taking a core dump of allspice recommend using a partition with at least 40MB free. There are only two such partitions on ginger: /home/ginger/users and /home/ginger/pnh. I think it would be very bad manners to start taking space on pnh, and I have doubts about /home/ginger/users. What's the story on our disk storage on ginger? Are we planning to get more space? (Or is it time to do some housecleaning?) mike Log-Number: 30817 Subject: more notes on mail problems Date: Fri, 22 Mar 91 23:11:05 PST From: Mike Kupfer <kupfer> I don't know if this is relevant to the mail lossage we've been experiencing lately, but the past couple days I've noticed a lot of orphan sendmail lock files which I've had to remove by hand. There have also been a bunch of "no control file" entries in the queue. mike Log-Number: 30818 Date: Sat, 23 Mar 91 14:26:28 PST From: shirriff (Ken Shirriff) Subject: Violence died Violence suddenly locked up while I was using it. I couldn't do L1-A. It is running the ds3100 1.084 kernel. Ken Log-Number: 30821 Subject: assault reboot Date: Tue, 26 Mar 91 12:15:29 PST From: Mike Kupfer <kupfer> Assault died around 0100 this morning with "Fatal Error: Mem_Free: storage block already free". There were a dozen or two messages earlier on the console saying Corrupted directory? File ID <25, 0, n> dirBlockNum 0, blockOffset 512 where n had the values 6497, 61097, 42216, and 21040. Is /pcs/scratch giving us a hard time again? One more oddity: after assault went into the debugger, it was still sort of talking on the net. There was a reboot message for roar at around 0900 on assault's console, and when I came in sage was repeatedly trying to go through recovery with assault. mike Log-Number: 30822 Date: Tue, 26 Mar 91 16:38:14 PST From: shirriff (Ken Shirriff) Subject: Allspice crash Allspice crashed yesterday (or maybe the day before) with: FsReopenHandle: file "make 4353" client 53 has dirty blocks but client 62 is using MachHandleTrap: entering debugger Log-Number: 30823 Date: Wed, 27 Mar 91 13:59:59 PST From: shirriff (Ken Shirriff) Subject: Allspice crash Last night the talkd went berserk and started printing zillions of "stale remote handle" errors. I tried L1-J to stop the syslog, but I couldn't get anything through on the keyboard; it just said "serial overrun on serialB" or whatever. I tried getting it into the debugger but it didn't work; even L1-A wouldn't work. I think talkd has done this before. Ken Log-Number: 30824 Date: Thu, 28 Mar 91 10:04:53 PST From: ouster@dill (John Ousterhout) Subject: Allspice crash Allspice just crashed. First it printed a message about "sanity check failed on outgoing packet", then it seemed to be entering the debugger, then it printed another message about another RpcSanityCheck (on an incoming packet?) "packet too short, 98 < 415131", then it entered the debugger again. I have to leave to give a talk at Adobe, so I couldn't do any debugging. I've started Allspice rebooting. -John- Log-Number: 30825 Subject: Re: Allspice crash Date: Thu, 28 Mar 91 12:13:29 PST From: Mike Kupfer <kupfer> One bit of additional information: Jim left a note on my keyboard that included the client and server ID's, which were garbage: 10816319 and 2145348069. mike Log-Number: 30832 Subject: raid1 deadlock Date: Mon, 01 Apr 91 14:49:06 PST From: Mike Kupfer <kupfer> raid1 apparently deadlocked itself on /r1, so I rebooted it. Here are some excerpts from my debugging session: 14d01 0 [0, 0] [17,600000] f6dbf4ac waiting Rpc_Server 34d0d 0 [0, 0] [9,260000] f6dbf4ac waiting Rpc_Server 24d0e 0 [0, 0] [37,320000] f6dbf4ac waiting Rpc_Server 34d0f 0 [0, 0] [7,620000] f6dbf4ac waiting Rpc_Server 14d10 0 [0, 0] [33,240000] f6dbf4ac waiting Rpc_Server 14d17 0 [0, 0] [5,540000] f6dbf4ac waiting Rpc_Server 64d18 0 [0, 60000] [0, 20000] f6dbf4ac waiting csh 14d1f 0 [0, 0] [5,560000] f6dbf4ac waiting Rpc_Server [...] (gdb) up 3 Reading in symbols for fsutilHandle.c...done. #3 0xf605df78 in Fsutil_HandleFetch (fileIDPtr=(struct Fs_FileID *) 0xf6bf66a8) (fsutilHandle.c line 578) 578 (void) Sync_Wait(&hdrPtr->unlocked, FALSE); (gdb) print *hdrPtr $2 = {fileID = {type = 1, serverID = 77, major = 1, minor = 2}, flags = 3, unlocked = {waiting = 1}, refCount = 14, name = 0xf6789da8 "/r1", lruLinks = {prevPtr = 0xf74541f8, nextPtr = 0xf6de1150}, lockProcID = -165697440} (gdb) print /x hdrPtr.lockProcID $4 = 0xf61fa860 (gdb) print &hdrPtr.unlocked $5 = (struct Sync_Condition *) 0xf6dbf4ac Note the bogus lockProcID. mike Log-Number: 30834 Subject: syslog times are wrong again Date: Tue, 02 Apr 91 12:45:08 PST From: Mike Kupfer <kupfer> The times in the syslog have been off by an hour since the start of the month. I assume this is because the kernel still doesn't understand the daylight savings rules. When does daylight savings time start this year--the end of the month? mike Log-Number: 30835 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 2 Apr 1991 13:13:49 PST Subject: Re: syslog times are wrong again The kernel assumes that daylight savings time starts and ends at the beginning of the month. Daylight savings time starts for real on the 7th, so we only have to put up with it for the rest of the week. John Log-Number: 30836 Subject: yet another raid1 deadlock Date: Tue, 02 Apr 91 13:15:00 PST From: Mike Kupfer <kupfer> Sage hung up on raid1 (which was running 1.084) again. As near as I can tell, there's a memory smash happening. Consider the domainPtr from Fsio_FileReopen (gdb) print *domainPtr $10 = {domainPrefix = 0x1, domainNumber = 77, flags = 1, refCount = 170178, condition = {waiting = -1}, backendPtr = 0xf6058668, domainOpsPtr = 0x0, clientData = 0xf60d16c8} and the hdrPtr from Fsutil_HandleFetch (gdb) print *hdrPtr $11 = {fileID = {type = 1, serverID = 77, major = 1, minor = 170178}, flags = 3, unlocked = {waiting = 1}, refCount = 1, name = 0xffffffff, lruLinks = {prevPtr = 0xf69fa9c8, nextPtr = 0xf72a1498}, lockProcID = -165697440} (gdb) print /x hdrPtr.lockProcID $12 = 0xf61fa860 Here's the stack backtrace of my hung RPC server on raid1. (gdb) bt #0 0xf600c6f0 in Mach_ContextSwitch () #1 0xf60b80a0 in SyncEventWaitInt (event=4146719212, wakeIfSignal=0) (syncLock.c line 675) #2 0xf60b6ae8 in Sync_SlowWait ( conditionPtr=(struct Sync_Condition *) 0xf729e9ec, lockPtr=(struct Sync_KernelLock *) 0xf60ea180, wakeIfSignal=0) (syncLock.c line 284) #3 0xf605df78 in Fsutil_HandleFetch ( fileIDPtr=(struct Fs_FileID *) 0xf6686678) (fsutilHandle.c line 578) #4 0xf605d870 in Fsutil_HandleInstall ( fileIDPtr=(struct Fs_FileID *) 0xf6686678, size=316, name=(char *) 0xffffffff, cantBlock=0, hdrPtrPtr=(struct Fs_HandleHeader **) 0xf806bcdc) (fsutilHandle.c line 317) #5 0xf60438ec in Fsio_LocalFileHandleInit ( fileIDPtr=(struct Fs_FileID *) 0xf6686678, name=(char *) 0xffffffff, descPtr=(struct Fsdm_FileDescriptor *) 0xffffffff, cantBlock=0, newHandlePtrPtr=(struct Fsio_FileIOHandle **) 0xf806bcdc) (fsioFile.c line 80) #6 0xf604405c in Fsio_FileReopen (clientID=33, inData=(ClientData) 0xf6686678, outSizePtr=(ClientData) 0xf806bddc, outDataPtr=(ClientData *) 0xf806bdd8) (fsioFile.c line 431) #7 0xf6058680 in Fsrmt_RpcReopen (srvToken=(ClientData) 0xf6685608, clientID=33, storagePtr=(struct Rpc_Storage *) 0xf806bdc8) (fsrmtDomain.c line 585) #8 0xf60ade88 in Rpc_Server () (rpcServer.c line 255) #9 0xf60b3310 in Sched_StartKernProc (...) (...) Log-Number: 30837 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 2 Apr 1991 13:19:57 PST Subject: Re: yet another raid1 deadlock Some of the stuff in the domain header does indeed look screwed up. This sort of deadlock has happened before, however, and may not be due to overwriting memory. Should raid1 deadlock again use the lockProcID field (this is a pointer to a Proc_ControlBlock) to backtrace the stack of the process that is holding the lock. That should give us some more clues. John Log-Number: 30838 Subject: Re: yet another raid1 deadlock Date: Tue, 02 Apr 91 14:07:05 PST From: Mike Kupfer <kupfer> > Should raid1 deadlock again use the lockProcID > field (this is a pointer to a Proc_ControlBlock) Is there some reason why lockProcID is declared to be an int in fs.h? mike Log-Number: 30839 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 2 Apr 1991 14:09:53 PST Subject: Re: yet another raid1 deadlock I assume it is because of problems with the include files, ie you get a circularity in the include files if you try to define it as such. John Log-Number: 30840 Subject: -lc_g not working w/ ds5000? Date: Tue, 02 Apr 91 15:43:52 PST From: Mike Kupfer <kupfer> I tried to rebuild "migrate" (formerly "mig) for the ds3100, using a ds5000. Appended is the make log. It worked fine when I did a plain "make" on a ds3100. Is this a configuration problem, or a bug in the tools, or what? mike -- covet% cd /sprite/src/cmds/migrate covet% make TM=ds3100 --- ds3100.md/mig.o --- rm -f ds3100.md/mig.o cc -g3 -O -Dds3100 -Dsprite -Uultrix -I. -Ids3100.md -I/sprite/lib/include -I/sprite/lib/include/ds3100.md -c mig.c -o ds3100.md/mig.o --- ds3100.md/migrate --- rm -f ds3100.md/migrate cc -g3 -O -Dds3100 -Dsprite -Uultrix -I. -Ids3100.md -I/sprite/lib/include -I/sprite/lib/include/ds3100.md -o ds3100.md/migrate ds3100.md/mig.o -lc_g ld: Can't locate file for: -lc_g with -B1.31 ld: Usage: ld [options] file [...] *** Error code 1 make: 1 error Log-Number: 30841 Subject: rcs messup? Date: Tue, 02 Apr 91 16:15:20 PST From: Mary Baker <mgbaker> The delta text for the rcs'd copy of fsrmtFile.c is missing in fsrmtFile.c,v. This means you can't do an rcsdiff on the file. Has anyone heard of this kind of rcs messup before? Mary Log-Number: 30842 Subject: Re: rcs messup? Date: Tue, 02 Apr 91 16:33:20 PST From: Mike Kupfer <kupfer> The RCS file for /etc/spritehosts got clobbered at some time in the past, and I had to flush a bunch of old RCS revisions. I blamed it on file system problems. mike Log-Number: 30843 Date: Tue, 2 Apr 91 18:15:21 PST From: elm (ethan miller) Subject: problems with # of blocks in ls/du? For some reason, a file of a constant size gives a different # of blocks on /scratch1 and /sprite/src/kernel. The files on /scratch1 are consistently larger than the ones on /sprite/src/kernel, despite their identical file sizes (as listed with ls -l). Is this supposed to happen? (I noticed it when transferring my kernel src from /scratch1 to /sprite/src/kernel after the new disk was created.) I'll leave copies of /sprite/src/kernel/net.elm/sun4.md/netHppi.c.mod and /scratch1/elm/src/.... around in case you need to check them. thanks ethan Log-Number: 30846 Date: Wed, 3 Apr 91 10:32:42 PST From: jhh@dill (John H. Hartman) Subject: fsattach and lfs Lfs file systems are currently attached in /hosts/$HOST/bootcmds, unlike normal file systems which are listed in /hosts/$HOST/mount. If you goof up and put an lfs in the mount table the machine will die with a hard error during fscheck. Fsattach should be modified to handle lfs file systems. John Log-Number: 30847 Date: Wed, 3 Apr 91 10:30:58 PST From: jhh@dill (John H. Hartman) Subject: new kernel problems Anise and allspice have been unsuccessful in running the new kernel (1.087). They both get a negative reference count releasing a handle in Fsio_StreamClientKill. I am going to reboot both of them with the old kernel. Here is a stack backtrace from allspice's last crash. The other backtraces looked similar. John #0 panic (__builtin_va_alist=-167369877) (sysPrintf.c line 220) #1 0xf60624b0 in Fsutil_HandleReleaseHdr (...) (...) #2 0xf604c110 in Fsio_StreamClientKill (...) (...) #3 0xf60642a4 in Fsutil_RemoveClient (...) (...) #4 0xf606422c in Fsutil_ClientCrashed (...) (...) #5 0xf60a98a0 in CrashCallBacks (...) (...) #6 0xf60a3d74 in Proc_ServerProc (...) (...) #7 0xf60b7020 in Sched_StartKernProc (...) (...) #8 0xf60b6fa0 in Sched_StartKernProc (...) (...) Log-Number: 30848 Date: Thu, 4 Apr 91 13:13:52 PST From: ouster (John Ousterhout) Subject: Reset error? I'm running version 1.087 of the kernel (new). A few minutes ago, the following two lines appeared in my console window: LE ethernet: Memory underflow error. Deferring reset. Then everything stopped. My machine wouldn't do anything that required communication with the outside world. After about a minute I typed control-N, at which point everything cleared up and tyranny went through recovery with the various file servers. Does this mean there's a bug in the new deferred reset code for the network? -John- Log-Number: 30849 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 4 Apr 1991 13:35:18 PST Subject: Re: Reset error? The new kernel should contain a fix for this bug. John Log-Number: 30850 Date: Thu, 4 Apr 91 13:38:06 PST From: ouster (John Ousterhout) Subject: Migration problem with 1.087? In trying to compile a few minutes ago I got the following error messages: <rmt notify> 4/4/91 14:35:39 larceny (73) RPC timed-out Warning: received status 30002 notifying process. <mig command> 4/4/91 14:36:25 larceny (73) RPC timed-out Could there be some sort of migration version number problem with the new kernel? -John- Log-Number: 30851 Subject: Re: Migration problem with 1.087? Date: Thu, 04 Apr 91 13:43:36 PST From: Mary Baker <mgbaker> Larceny is in an infinite recovery loop with allspice. We've seen this happen before, and I haven't yet figured out how to fix it for sure. John H. and I are working on putting out a new "new" kernel since there were bugs in 1.087. We may not be able to fix the infinite recovery loop in time. I'm hoping it's fixed by one of my other fixes, but I'm not sure yet. Mary Log-Number: 30852 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 5 Apr 1991 10:05:40 PST Subject: lfs bug Allspice crashed yesterday with an lfs bug in the new /sprite/src/kernel. That filesystem is unmounted until further notice. The message on the console was Dirtyblocks (2) after a checkpoint Fatal Error: Discriptor map foulup, can't find file 35213 at 6123 Allspice crashed a second time with the same bug, although the numbers were different. Running lfscheck on the filesystem produces lots of errors. I've dd'ed the filesystem onto the other new disk so that Mendel may look at it when he gets back. In the meantime I'm trying to get the first copy into decent shape. Wish me luck. John Log-Number: 30856 Date: Sat, 06 Apr 91 00:47:11 PST From: Mary Baker <mgbaker> Allspice crashed tonight while doing the restore of /sprite/src/kernel. It froze up totally and I had to reboot it. I got the pc of where it was stuck and will try to figure out what was happening. Mary Log-Number: 30857 From: mendel (Mendel Rosenblum) Subject: Network hangup on new kernel. Date: Sat, 06 Apr 91 11:01:57 PST I got the following message in my syslog on jaywalk: LE ethernet: Memory underflow error. Deferring reset. and the network appeared to stop working. I typed l1-n and everything cleared up. The kernel was: SPRITE VERSION 1.087 (sun4c) (2 Apr 91 14:22:57) Mendel Log-Number: 30858 Date: Sat, 6 Apr 91 11:46:57 PST From: gibson (Garth Gibson) Subject: May be an X bug or an xproof bug mustard 14> xproof -display apathy:0 -geometry +449+0 -scale 11 Introduction.n puts an xproof window scaled up by 10% in the right half of apathy's screen mustard 15> xproof -display apathy:0 -name XprfIntroduction.n Introduction.n puts an unscaled, free floating xproof window on apathy's screen and puts an useful title in my icon manager entry but mustard 16> xproof -display apathy:0 -name XprfIntroduction.n -geometry +449+0 -scale 11 Introduction.n does not do what I expect - the geometry and scale args are ignored is there some "user friendly" function deep in the bowels of X customization files that makes the geometry and scale parameters dependent on the default values of the name argument? Am I unusual, or do others find X customization difficult to learn, difficult to debug, and horrendous to encounter on another person's screen ? Log-Number: 30866 Date: Sat, 6 Apr 91 22:39:35 PST From: gibson@apathy.Berkeley.EDU (Garth Gibson) Subject: Re: May be an X bug or an xproof bug You win the prize. It is the .n extension on the -name argument. Gross. garth > From kupfer@sprite.Berkeley.EDU Sat Apr 6 21:53:37 1991 > Cc: bugs@sprite.Berkeley.EDU, tve@sprite.Berkeley.EDU > Subject: Re: May be an X bug or an xproof bug > > Try "-name XprfIntroduction" instead of "-name XprfIntroduction.n". I > suspect there's some bug or misfeature in the Xt argument parsing that > gets triggered by the period in the name. (I suspect it's a bug but > would want to reread the "resource specification" documentation before > submitting a bug report to MIT.) > > mike Log-Number: 30861 Date: Sat, 6 Apr 91 18:13:22 PST From: Dean Long <dlong@midgard.ucsc.edu> Subject: pmake pmake gets confused if a blank line in a Makefile has spaces in it. dl Log-Number: 30864 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 6 Apr 1991 18:24:35 PST Subject: allspice crashed Allspice crashed with a level 15 interrupt before we could shut it down cleanly. On reboot RPCs concerning /etc/spritehosts hung to a bunch of clients. We pulled out the network interface, then sat through a timeout on each and every client. Each one concerned /etc/spritehosts. After we plugged the network back in things went fine for a while, but then it all wedged up again. We sync'ed the disks in preparation for a reboot, but this cleared things up. There isn't much to be learned from all of this, except that it is possible for our system to get into very bad situations during recovery. John Log-Number: 30867 Date: Mon, 8 Apr 91 08:42:55 PDT From: ouster (John Ousterhout) Subject: Recovery problem in "new" kernel? When I came in today tyranny was in another infinite recovery loop, printing messages like the following: <consist done> 4/8/91 6:50:29 allspice (14) RPC timed-out Got error (30002) from consist reply on <10,90621> Although I'm not certain, I think this may have been happening ever since the Allspice reboot late Saturday afternoon. If so, then I think that the "new" kernel is 0-for-2 on recoveries after Allspice crashes (it's possible that there was a time when it recovered correctly, but I don't recall it). Has anyone else been having recovery problems with "new"? If so, then perhaps there is a new bug in the new kernel. -John- Log-Number: 30868 Subject: mail troubles Date: Mon, 08 Apr 91 11:27:52 PDT From: Mary Baker <mgbaker> On Sunday I was unable to access my mail spool file, even though I was told I had new mail when I logged in. Today I can read mail, but there are no messages from yesterday, so I'm concerned I lost some amount of mail. Could somebody please remind me where the sprite log is so that I can at least check it for any sprite messages from yesterday? Mary Log-Number: 30869 Subject: Yup - a lot of mail lost Date: Mon, 08 Apr 91 11:35:26 PDT From: Mary Baker <mgbaker> Looking at the sprite log, it appears I didn't get most of Saturday's mail either. I hope nobody sent me anything I needed to see. Mary Log-Number: 30870 Subject: Re: Yup - a lot of mail lost Date: Mon, 08 Apr 91 11:55:16 PDT From: Mike Kupfer <kupfer> Also, checking the Sprite log shows that John O.'s response about the network coprocessor was zapped--the log contains only the message header. mike Log-Number: 30871 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 8 Apr 1991 13:55:14 PDT Subject: O_NDELAY and tar.gnu I should have mentioned that in order to get the restore of /sprite/src/kernel to work I had to modify "tar.gnu". First of all, why is there a "tar" and a "tar.gnu"? I thought we had agreed to get rid of the former and rename the latter. Can someone who takes better notes a the Sprite meetings verify this? It turns out that tar.gnu would open files with the O_NDELAY flag. On Sprite this means that a write to a full cache returns EWOULDBLOCK (this is also a bug). Tar.gnu would then stop writing the file. All of our restored kernels were about 2 MB long, rather than 8-10 MB. I installed a new tar.gnu that doesn't have this problem. Note that Mendel previously fixed this in tar, but didn't know about tar.gnu. John Log-Number: 30872 Subject: more on raid1's hanging Date: Mon, 08 Apr 91 15:30:00 PDT From: Mike Kupfer <kupfer> Sage was getting stuck accessing /r1 again. When I put raid1 into the debugger, I found that one process was waiting to get the handle for /r1/mach3.0/ux. The process that was holding the lock for "ux" was in the block I/O code, waiting for a read to complete. (Appended below are excerpts from the gdb session.) I couldn't think of anything else to do except reboot raid1. mike -- (gdb) p Proc_Dump() Reading in symbols for procMisc.c...done. ID wtd user kernel event state name 10000 0 [0, 0] [0,580000] f61fa290 waiting 14d01 0 [0, 0] [5,960000] f8005570 waiting Rpc_Server [...] 14d10 0 [0, 0] [4,760000] f724a49c waiting Rpc_Server [...] 14d17 0 [0, 0] [5, 60000] f724a49c waiting Rpc_Server [...] 14d19 0 [0, 0] [5,240000] f724a49c waiting Rpc_Server [...] 64d20 0 [0, 0] [1, 60000] f724a49c waiting Rpc_Server [...] 34d3d 0 [0, 0] [0,340000] f724a49c waiting Rpc_Server [...] (gdb) pid 0x19 Attaching process entry: 0x19 Kernel returns with signal (16) Interrupt Trap Program received signal 2, Illegal Instruction Fault #0 0xf600c6f0 in Mach_ContextSwitch () (gdb) bt #0 0xf600c6f0 in Mach_ContextSwitch () #1 0xf60b80a0 in SyncEventWaitInt (...) (...) #2 0xf60b6ae8 in Sync_SlowWait (...) (...) #3 0xf605df78 in Fsutil_HandleFetch (...) (...) #4 0xf605d870 in Fsutil_HandleInstall (...) (...) #5 0xf60438ec in Fsio_LocalFileHandleInit (...) (...) #6 0xf604405c in Fsio_FileReopen (...) (...) #7 0xf6058680 in Fsrmt_RpcReopen (...) (...) #8 0xf60ade88 in Rpc_Server (...) (...) #9 0xf60b3310 in Sched_StartKernProc (...) (...) #10 0xf60b3290 in Sched_StartKernProc (...) (...) ERROR: invalid read address 0xa096dac (gdb) up 3 Reading in symbols for fsutilHandle.c...done. #3 0xf605df78 in Fsutil_HandleFetch ( fileIDPtr=(struct Fs_FileID *) 0xf6c26758) (fsutilHandle.c line 578) 578 (void) Sync_Wait(&hdrPtr->unlocked, FALSE); (gdb) print *hdrPtr $3 = {fileID = {type = 1, serverID = 77, major = 1, minor = 90833}, flags = 3, unlocked = {waiting = 1}, refCount = 2, name = 0xf7283e28 "ux", lruLinks = {prevPtr = 0xf727f898, nextPtr = 0xf7262a78}, lockProcID = -165698184} (gdb) print (Proc_ControlBlock *)hdrPtr.lockProcID $4 = (struct Proc_ControlBlock *) 0xf61fa578 (gdb) print *$4 $5 = {links = {prevPtr = 0xf61d9b70, nextPtr = 0xf61fa860}, processor = 0, state = PROC_WAITING, genFlags = 1, syncFlags = 0, schedFlags = 0, exitFlags = 0, childListHdr = {prevPtr = 0xf61fa598, nextPtr = 0xf61fa598}, childList = 0xf61fa598, siblingElement = {links = {prevPtr = 0xf61fa2b0, nextPtr = 0xf61fa88c}, procPtr = 0xf61fa578}, familyElement = {links = {prevPtr = 0xffffffff, nextPtr = 0xffffffff}, procPtr = 0xf61fa578}, processID = 85249, parentID = 65536, familyID = -1, userID = 0, effectiveUserID = 0, event = -134195856, eventHashChain = {links = {prevPtr = 0xf6102660, nextPtr = 0xf6102660}, procPtr = 0xf61fa578}, waitCondition = {waiting = 0}, lockedCondition = {waiting = 0}, waitToken = 0, billingRate = 2, recentUsage = 0, weightedUsage = 0, unweightedUsage = 0, kernelCpuUsage = {ticks = {seconds = 5, microseconds = 960000}, time = {seconds = 5, microseconds = 960000}}, userCpuUsage = {ticks = {seconds = 0, microseconds = 0}, time = {seconds = 0, microseconds = 0}}, childKernelCpuUsage = {ticks = {seconds = 0, microseconds = 0}, time = {seconds = 0, microseconds = 0}}, childUserCpuUsage = {ticks = {seconds = 0, microseconds = 0}, time = {seconds = 0, microseconds = 0}}, numQuantumEnds = 0, numWaitEvents = 7852, schedQuantumTicks = 5, machStatePtr = 0xf662ce48, vmPtr = 0xf62ccd90, fsPtr = 0xf6627988, termReason = 0, termStatus = 0, termCode = 0, sigHoldMask = 256, sigPendingMask = 0, sigActions = {0, 2, 2, 2, 2, 1, 1, 2, 2, 5, 5, 6, 1, 6, 1, 1, 0, 0, 1, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, sigMasks = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, sigCodes = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, sigFlags = 0, oldSigHoldMask = 0, sigAddr = 0, timerArray = 0xffffffff, peerHostID = -1, peerProcessID = 4294967295, rpcClientProcess = 0xffffffff, environPtr = 0xf62ccf38, argString = 0xf6434f50 "Rpc_Server", lockInfo = {value = 0, name = 0xf60a0928 "Proc:perPCBlock", holderPC = 0xf609ad4c "@", holderPCBPtr = 0xf6c14af0 "\366\035\233p\366\035\233p"}, kcallTable = 0xf61dfda8, specialHandling = 0, Prof_Buffer = 0xffffffff, Prof_BufferSize = 0, Prof_Offset = 0, Prof_Scale = 0, Prof_PC = 0, remoteExecBuffer = 0xffffffff "", migCmdBuffer = 0xffffffff "", migCmdBufSize = 0, migFlags = 0, preEvictionUsage = {ticks = {seconds = 0, microseconds = 0}, time = {seconds = 0, microseconds = 0}}, unixErrno = 0, unixProgress = 0, extraField = {0, 0, 33, 33, 3, 0, 1, 0, 0, 0}} (gdb) print /x 85249 $6 = 0x00014d01 (gdb) pid 1 Attaching process entry: 0x1 Kernel returns with signal (16) Interrupt Trap Program received signal 2, Illegal Instruction Fault #0 0xf600c6f0 in Mach_ContextSwitch () (gdb) bt #0 0xf600c6f0 in Mach_ContextSwitch () #1 0xf60b80a0 in SyncEventWaitInt (...) (...) #2 0xf60b703c in Sync_SlowMasterWait (...) (...) #3 0xf6016d1c in Dev_BlockDeviceIOSync (...) (...) #4 0xf608faa0 in OfsDeviceBlockIO (...) (...) #5 0xf608e774 in Ofs_FileDescFetch (...) (...) #6 0xf60387e8 in Fsdm_FileDescFetch (...) (...) #7 0xf60439ac in Fsio_LocalFileHandleInit (...) (...) #8 0xf604b3a0 in FindComponent (...) (...) #9 0xf604a67c in FslclLookup (...) (...) #10 0xf604989c in FslclGetAttrPath (...) (...) #11 0xf6056670 in Fsrmt_RpcGetAttrPath (...) (...) #12 0xf60ade88 in Rpc_Server (...) (...) #13 0xf60b3310 in Sched_StartKernProc (...) (...) (gdb) up 3 Reading in symbols for devBlockDevice.c...done. #3 0xf6016d1c in Dev_BlockDeviceIOSync ( blockDevHandlePtr=(struct DevBlockDeviceHandle *) 0xf60de688, requestPtr=(struct DevBlockDeviceRequest *) 0xf80055f8, amountTransferredPtr=(ClientData) 0xf80055f4) (devBlockDevice.c line 268) 268 Sync_MasterWait((&syncCmdData.wait),(&syncCmdData.mutex),FALSE); (gdb) print syncCmdData $7 = {mutex = {value = 0, name = 0xf6016b84 "BlockSyncCmdMutex", holderPC = 0xf60b71f4 "\177\375U\013\001", holderPCBPtr = 0xf61fa578 "\366\035\233p\366\037\250`"}, wait = {waiting = 1}, done = 0, amountTransferred = 0, status = 0} (gdb) frame 10 Reading in symbols for fslclDomain.c...done. #10 0xf604989c in FslclGetAttrPath ( prefixHandlePtr=(struct Fs_HandleHeader *) 0xf6c2b1b8, relativeName=(char *) 0xf6628e48 "mach3.0/ux/server", argsPtr=(char *) 0xf6628a48 "", resultsPtr=(char *) 0xf8005d48 "\366c}\230\366c}\250", newNameInfoPtrPtr=(struct Fs_RedirectInfo **) 0xf8005d3c) (fslclDomain.c line 248) 248 &handlePtr, newNameInfoPtrPtr); (gdb) print relativeName $8 = (char *) 0xf6628e48 "mach3.0/ux/server" (gdb) frame 3 #3 0xf6016d1c in Dev_BlockDeviceIOSync ( blockDevHandlePtr=(struct DevBlockDeviceHandle *) 0xf60de688, requestPtr=(struct DevBlockDeviceRequest *) 0xf80055f8, amountTransferredPtr=(ClientData) 0xf80055f4) (devBlockDevice.c line 268) 268 Sync_MasterWait((&syncCmdData.wait),(&syncCmdData.mutex),FALSE); (gdb) print *blockDevHandlePtr $9 = {blockIOProc = 0x9de3bf90, IOControlProc = 0x9a100019, releaseProc = 0xd806201c, minTransferUnit = -771538944, maxTransferSize = -2136842239, clientData = 0x280000b} (gdb) x/i 0x9de3bf90 0x9de3bf90: ERROR: invalid read address 0x9de3bf90 (gdb) x/i 0xd806201c 0xd806201c: ERROR: invalid read address 0xd806201c (gdb) print &Dev_BlockDeviceIOSync $10 = (int (*)()) 0xf6016b98 (gdb) x/i 0xf6016b98 0xf6016b98 <Dev_BlockDeviceIOSync>: save sp,0xffffff70,sp (gdb) print *requestPtr $11 = {operation = 1, startAddress = 29765632, startAddrHigh = 0, bufferLen = 4096, buffer = 0xfa071000 "joel/162/ex61\".\nCan't read directory \"./joel/162/ex62\".\nCan't read directory \"./joel/162/ex63\".\nCan't read directory \"./joel/162/ex64\".\n0 errors found\nChecksum completed at Tue Dec 25 04:57:04 PST 199"..., doneProc = 0xf60169e8, clientData = 0xf8005560, ctrlData = {0, 0, 0, 134217728, -1, 0, -166819920, 0, 0, -167620776, 12, 0, 0, -134195584, 0, -160436504}} Log-Number: 30874 Subject: more on raid1's hanging Date: Wed, 10 Apr 91 22:00:20 PDT From: Mike Kupfer <kupfer> I once again managed to get raid1 stuck on /r1/mach3.0/ux. As with the previous message I sent out on this problem, the process that has the handle locked is doing a get-attr-path on /r1/mach3.0/ux/server, waiting in Dev_BlockDeviceIOSync for a read to complete, and its blockDevHandlePtr has garbage function pointers. mike Log-Number: 30886 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 15 Apr 1991 16:57:30 PDT Subject: lost IO to cache block Raid1 has been hanging up lately due to an IO to a cache block that never completes. A call is made to RaidBlockIOProc, but the callback is never done. Mike Kupfer tells me that the problem is very repeatable, and it always involves the same directory/file. I seem to remember Mendel saying something about this before, but I can't find a reference in the sprite log. Perhaps this is a hardware problem with the Jaguar boards? If so, shouldn't we have some sort of timeout mechanism? Perhaps we should wait until Mendel gets back since he probably understands what's going on. John Log-Number: 30873 Date: Tue, 9 Apr 91 10:39:20 PDT From: mottsmth (Jim Mott-Smith) Subject: Allspice crash When I came in this morning at about 9 Allspice was dead in the water with the following on the console: inetd[50e68] time/tcp accept: invalid arg Dirty blocks(2) after a checkpoint Fatal error: VmRawAlloc: Out of Memory Entering debugger with a Interrupt trap (16) exception at PC 0xf60be9c4 The entering debugger message was repeated a dozen times. No console break-commands would do anything. Kgcore from Ginger would not work. It kept saying 'timing out resending'. So I rebooted Allspice according to the directions taped to the console. -- Jim M-S Log-Number: 30875 Subject: Sync_Wait can return wrong value Date: Thu, 11 Apr 91 15:23:52 PDT From: Mike Kupfer <kupfer> There's a fair amount of kernel code that checks whether a wait on a condition variable was interrupted by a signal. Unfortunately, the main "wait" primitive, SyncEventWaitInt, only checks for signals before the context switch; it doesn't check after it wakes up again and can therefore return the wrong value. Also, a minor bug: the comment header for Sync_Wait is wrong, since Sync_Wait is in fact supposed to return a meaningful value. mike Log-Number: 30877 Date: Fri, 12 Apr 91 09:11:00 PDT From: bmiller (Bob Miller) Subject: allspice Allspice was "down" when I came in this morning. The console was scrolling continuously with "stale remote file handle" messages. I reset it and rebooted. Bob Log-Number: 30878 Date: Fri, 12 Apr 91 11:47:06 PDT From: pmchen (Peter M. Chen) Subject: messed up mail My mail file had part of a postscript file in the middle. I moved /usr/spool/mail/pmchen to ~pmchen/tmp/badmail. Pete Log-Number: 30879 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 12 Apr 1991 12:56:12 PDT Subject: pmake problems If I try to compile for the sun3 on a sun4c I get the following message: tyranny<jhh 20> pmake sun3 --- .BEGIN --- Sorry, the target machine (sun4) isn't in the list of allowed machines (sun3). exit 1 *** Error code 1 I'm trying to compile a command. John Log-Number: 30882 Subject: problem with signals on sun3 w/ new kernel Date: Fri, 12 Apr 91 16:38:11 PDT From: Mike Kupfer <kupfer> The binary compatibility code for sun3s doesn't handle signals quite right. I can frequently (though not always) crash Emacs by running the directory browser "dired". I get syslog messages like Unix signal 20(17) to 5301e sp=9fd9fc, pc=5312c, ps=0, len=36 to 9fde9d8, exPc=5454e MachTrap: Bus error in user proc 5301E, PC=5316a, addr=14 BR Reg 80 signal 20 is SIGCHLD, and process 5301e is Emacs. If I dired a directory and don't see the "Unix signal" message, Emacs doesn't die. mike Log-Number: 30883 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 12 Apr 1991 17:57:28 PDT Subject: new disk problem fixed I've tentatively fixed the problem with the big disks. There are enough sectors on the disk to overflow the offset field in the standard SCSI read/write commands. I had to modify the SCSI device driver to use the extended read/write commands if the offset it too large. The previous driver never checked that the offset fit, hence it would wrap back to the beginning of the disk once it overflowed. My changes are in the uninstalled dev module. I have the disks attached to jaywalk where I'm running some testing programs to make sure it all works. john Log-Number: 30884 Subject: tokens not reported by ID database Date: Sun, 14 Apr 91 13:01:19 PDT From: Mike Kupfer <kupfer> Sig_CheckForKill is called in sun4.md/machTrap.s. "lid" lists it, but "gid" does not. mike Log-Number: 30887 Date: Mon, 15 Apr 91 22:02:22 PDT From: tve (Thorsten von Eicken) Subject: executable on nfs-mounted disks Why can't I execute a file on an nfs-mounted disk? Can this restriction be removed? Thanks, TvE Log-Number: 30888 Subject: g++ Date: Tue, 16 Apr 91 11:38:45 PDT From: Mike Kupfer <kupfer> [Sorry, I included Michial's message in my reply so that it would get logged, then I went and cc'd the wrong list.] ------- Forwarded Message Date: Mon, 15 Apr 91 17:21:15 PDT >From: gunter (Michial Gunter) To: root Subject: g++ I am new to Sprite and don't really know how much effort is made to support various things. In particular, is there any support for g++? If so: g++ -g -Wall -msun4 -c ch_error.cc g++: installation problem, cannot exec g++1.sparc: no such file or directory and g++ -g -Wall -mds3100 -c main.cc /sprite/lib/include/g++/stdarg.h:39: /sprite/lib/include/ds3100.md/stdarg.h: no such file or directory thanks, mike ------- End of Forwarded Message Log-Number: 30889 Subject: proc flag used as state Date: Tue, 16 Apr 91 14:00:00 PDT From: Mike Kupfer <kupfer> syncLock.c has some code switch (procPtr->state) { case PROC_WAITING: break; case PROC_MIGRATING: panic("Can't handle waking up a migrating proc.\n"); break; However, PROC_MIGRATING is a flag, not a state value. I assume I can just flush this case of the switch...? mike Log-Number: 30890 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 16 Apr 1991 18:04:59 PDT Subject: bcopy broken Bcopy doesn't seem to work correctly on the ds5000. The kernel does a bcopy very early in the boot. This bcopy only copied the first three bytes correctly, the rest were garbage. I replaced the symbolic link in /sprite/src/kernel/libc/ds5000.md with a copy of bcopy.c, then removed the stuff about CheckAccessible. Voila it works. I'm going to leave it that way for now since the ds5000 doesn't have any installed sources. John Log-Number: 30891 Date: Wed, 17 Apr 91 08:18:14 PDT From: tve (Thorsten von Eicken) Subject: lfs disk full still at 75% !!! We need more than 75% of our 1 disks! Mendel, when can you please change that? TvE Log-Number: 30892 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 17 Apr 1991 11:00:57 PDT Subject: lfs died Lfs died when the disk filled up and it couldn't modify the descriptor map. I realize that Sprite in general isn't very robust when the disk fills up, but maybe this is easy to fix? John Log-Number: 30893 Date: Wed, 17 Apr 91 11:55:40 PDT From: tve (Thorsten von Eicken) Subject: ipserver on assault dead? I can't rlogin. Log-Number: 30896 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: ipserver on assault dead? I can't rlogin. Date: Wed, 17 Apr 91 21:00:54 +0200 Earlier today, I noticedthat several machines seemed to have fallen into a state where I could not rlogin to them but I could telnet in. This is weird, and suggests some weirdness with inetd. Fred Log-Number: 30894 Subject: /sprite/src/kernel/libc/symm.md is broken Date: Wed, 17 Apr 91 11:57:24 PDT From: Mike Kupfer <kupfer> If you do "ls" or "echo *", you get Assertion failed: (dp->d-_namlen <= 255) from the readdir package. "pwd" yields pwd: getwd: can't open .. mike Log-Number: 30895 Date: Wed, 17 Apr 91 11:58:35 PDT From: tve (Thorsten von Eicken) Subject: /sprite/spool/msgs: no such file or directory Log-Number: 30897 Subject: Re: /sprite/spool/msgs: no such file or directory Date: Wed, 17 Apr 91 12:33:26 PDT From: Mike Kupfer <kupfer> The nfsmount of /sprite/spool/msgs is failing: <11>Apr 17 12:28:28 syslog: nfsmount: Pfs_Open: "/sprite/spool/msgs" service failed, errno 26 whatever that means. I restarted the IP server, but rlogin still fails. I think somebody should just reboot assault, especially seeing as how it's not yet running the 1.089 kernel. mike Log-Number: 30898 Date: Wed, 17 Apr 91 12:42:45 PDT From: tve (Thorsten von Eicken) Subject: rlogind unhappy on assault assault-8# /sprite/daemons/rlogind Segmentation violation assault-9# Log-Number: 30900 Subject: disk error on assault Date: Wed, 17 Apr 91 14:12:20 PDT From: Mike Kupfer <kupfer> Just before lunch I noticed the following error message on assault's console: SCSI disk SII#0 Target LUN 0 error: media error - info bytes 0x0 0x2 0x94 0x58 File blk 74 phys blk 74188: 4/17/91 12:06:35 Sprite Host <25> File "rchip.cif" <3,24998> Write-back failed: DISK ERROR mike Log-Number: 30901 Subject: allspice crash Date: Wed, 17 Apr 91 14:23:31 PDT From: Mary Baker <mgbaker> Allspice crashed with a page fault in Fs_ReadLinkStub on line 1662 in fsSysCall.c where it accesses the user's buffer. For all the other user addresses it accesses, the procedure uses routines such as Proc_ByteCopy that call Vm_CopyIn and stuff so that page faults are okay. But this buffer, it just goes and touches with no assurances. This sounds like a problem to me, unless somebody can tell me otherwise. Mary Log-Number: 30902 Subject: mop error messages Date: Wed, 17 Apr 91 20:55:14 PDT From: Mike Kupfer <kupfer> I tried booting arson using mop and got a bunch of error messages (I forgot to write down the exact message). There are some corresponding error messages in violence's syslog, like: [Wed Apr 17 20:20:32 1991]: Out of sequence request 1 vs 2 from 7ddffab6:100061d4:1000c8d8:1000b238:40051c:09 [Wed Apr 17 20:20:32 1991]: Out of sequence request 2 vs 3 from 7ddffab6:100061d4:1000c8d8:1000b238:40051c:09 [Wed Apr 17 20:20:33 1991]: Out of sequence request 3 vs 4 from 7ddffab6:100061d4:1000c8d8:1000b238:40051c:09 (1) Is there some resynchronization mechanism in the mop protocol? (2) If I get errors like this should I just retry, or do I need to restart the mop server? (3) Is there some reason why the mop server is on violence instead of assault? mike Log-Number: 30903 Date: Thu, 18 Apr 91 09:50:36 PDT From: tve (Thorsten von Eicken) Subject: unfsd I tried on shallot: mount assault:/graphics /mnt. The mount takes 1-2 minutes, a following df takes about 5 minutes... Is there hope to fix it? TvE Log-Number: 30904 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 18 Apr 1991 11:15:41 PDT Subject: pmake installall broken I thought that "pmake installall" would do all machines that it is appropriate to install for on the host, ie sun3, sun4, sun4c on a sun4c, and ds3100, ds5000 on a ds3100. Instead, if I do it on a sun4c the first thing it does it try to compile for a ds5000, which obviously isn't going to work. John Log-Number: 30908 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 19 Apr 1991 12:56:40 PDT Subject: weird filesystem (lfs?) problem Jaywalk had a disk at target 0, with an LFS on it. During the boot the kernel opens up this disk to see if it is the root disk. Since it isn't it detaches it and continues the boot. Once the machine came up I used the "prefix" command to attach the disk under a temporary prefix. Somehow the kernel ended up thinking that this filesystem was remote, so that the block cleaner would try to write the blocks by doing an RPC to jaywalk itself. I'm not sure how this could happen. This bug shouldn't be a problem until we switch more of our filesystems over to LFS. Jaywalk was running kernel 1.090. The disk was one of the new micropolis disks. John Log-Number: 30909 Date: Sat, 20 Apr 91 13:48:57 PDT From: s244al@stat.Berkeley.EDU (Michial Gunter) Subject: Getting to know I initially send the below message concerning g++ to root. It would be nice if there new accounts were configured with a README file including such things as where to send mail in the event of different problems/requests. I don't know what is planned for Sprite. (Though, if y'all have the time I wouldn't ming knowing.) Knowing this, as well having some knowledge of the degree of support given for various things would be helpful. To this end, it would be help if there were an introduction to Sprite online (perhaps ``man sprite'' should produce something other than No manual entry for "sprite".) An introduction with references to other documents would be helpful. A document detailing how Sprite differs from (improves upon!) other versions of Unix would be particularly useful. The documentation for cc needs to be updated: -mtm Compile code for the target machine given by tm. If this switch is not given, then the default is to com- pile for the machine given by the MACHINE environment variable. The following machine types are currently defined: 68000, 68010, and sun2 (all of which compile for the 68000 instruction set); 68020 and sun3 (both of which compile for the 68020 instruction set); and spur. See below for additional -m switches to control other machine-dependent features. There is no manual entry for g++. I am willing to do the things I am requesting. I will become available to do so after May 18 (the date of my last final.) I should be able to contribute some time throughout the summer (my obligation now include only a couple of math classes.) Here is my initial message: > Date: Mon, 15 Apr 91 17:21:15 PDT > From: gunter (Michial Gunter) > To: root > Subject: g++ > > I am new to Sprite and don't really know how much effort is > made to support various things. > > In particular, is there any support for g++? > If so: > > g++ -g -Wall -msun4 -c ch_error.cc > g++: installation problem, cannot exec g++1.sparc: no such file or directory > > and > > g++ -g -Wall -mds3100 -c main.cc > /sprite/lib/include/g++/stdarg.h:39: /sprite/lib/include/ds3100.md/stdarg.h: no > such file or directory > > > thanks, > mike and Mike Kupfer's reply > Return-Path: <kupfer@allspice.Berkeley.EDU> > To: gunter@allspice.Berkeley.EDU > Cc: spriters@allspice.Berkeley.EDU > Subject: Re: g++ > In-Reply-To: Your message of Mon, 15 Apr 91 17:21:15 -0700 > Date: Tue, 16 Apr 91 11:35:45 PDT > From: Mike Kupfer <kupfer@allspice.Berkeley.EDU> > > Have you gotten a response yet about your g++ questions? I don't use > g++, so I don't know what is causing the problems you're having. > However, I do know that people have done work using C++ on Sprite. If > you're still having problems, we should be able to track them down. > > Also, please submit future bug reports to "bugs", not root. This > guarantees that the bug will be discussed at the weekly Sprite > meeting. > > thanks, > mike > -- I have gotten no further response. thank you very much, mike gunter@sprite Log-Number: 30910 Date: Sat, 20 Apr 91 14:17:29 PDT From: elm (ethan miller) Subject: what's the story with /r1? raid1 is up (at least it responds to pings), but I can't access /r1. Is something being done about this? Is the condition permanent (ie, should I look elsewhere for the storage space I need)? ethan Log-Number: 30911 Subject: anise having troubles? Date: Sat, 20 Apr 91 15:22:45 PDT From: Mary Baker <mgbaker> Anise seems to be in an infinite cleaning loop, if that's possible. It's repeatedly saying: /user5: Cleaning started - 7 segs Can't fetch handle for file 28341 for cleaning (7 more of these messages for different file numbers) /user5: Cleaned 0 segments in 0 segments I'm not sure what's the right thing to do about this. Mary Log-Number: 30912 Subject: more about anise Date: Sat, 20 Apr 91 15:48:23 PDT From: Mary Baker <mgbaker> While anise was having its problems, I noticed that the daily dumps were still running on allspice and that it was still trying to dump /user5 on anise. I rebooted anise. So far it has not restarted the cleaning problem, and allspice was able to finish dumping /user5. Mary Log-Number: 30913 Subject: mmap man page Date: Sat, 20 Apr 91 16:21:00 PDT From: Mary Baker <mgbaker> Mark Sullivan reports that the mmap() man page is incorrect. It says that mmap returns 0 if successful and -1 if not. It actually returns the address of the mmap'd region. I looked at the man page and it has a warning saying that it may not be correct. Is there a good reason why it shouldn't be made correct? Mary Log-Number: 30914 Subject: panic on sched_MutexPtr Date: Sat, 20 Apr 91 16:36:09 PDT From: Mary Baker <mgbaker> Jaywalk died while trying to put a process into the debug state. It panic'd on a MASTER_LOCK of the scheduler mutex in Sync_SlowBroadcast. This was about the same time that Mark executed a "killdebug" which successfully killed 2 processes and then the machine died. The process that was trying to go onto the debug list was "make." I'm sorry I don't have a stack trace, but I was debugging from home and my connection got all messed up. Mary Log-Number: 30915 Date: Sat, 20 Apr 91 17:40:22 PDT From: sullivan (Mark Sullivan) Subject: dgram socket bug I have a routine for reading a packet from a socket. It uses the recvfrom() system call with the "peek" flag on to read a fixed size packet hdr. The packet hdr contains a length field. The length field is used to determine how many bytes to read in a second recvfrom() call with the "peek" flag off. In two programs that use this same routine for datagram sockets, the socket seems to go into permanent peek mode after the first peek. One packet is sent to the socket. The program reads that one packet as described above. The program makes a select() call. It finds another copy of the packet at the socket. Reads it in again. Continues like this indefinitely. I am certain that the sender only sends one packet. The bug is repeatable and occurred in two different programs using the same library routine. The program works correctly on ultrix. The bug went away on Sprite when I removed the first recvfrom() and simply read the entire packet in at once. Mark Log-Number: 30916 Subject: jaywalk crash again Date: Sat, 20 Apr 91 17:49:47 PDT From: Mary Baker <mgbaker> I rebooted jaywalk with the "new" 1.091 kernel. Mark was using it and it crashed again. A process was trying to go onto the debug list and got hit with an interrupt while context switching. So it panic'd on the MASTER_LOCK of sched_MutexPtr in Sched_GatherProcessInfo. The PC of the holder of the lock was in Sched_LockAndSwitch. This is ungood. #0 panic (__builtin_va_alist=-167089713) (sysPrintf.c line 220) #1 0xf60a6d58 in Sched_GatherProcessInfo (...) (...) #2 0xf60b5058 in Timer_CallBack (...) (...) #3 0xf60b620c in Timer_TimerServiceInterrupt (...) (...) #4 0xf600fba8 in MachHandleInterrupt () #5 0xf600c7ac in Mach_ContextSwitch2 () #6 0xf60a71a8 in Sched_ContextSwitchInt (...) (...) #7 0xf60a7f68 in Sched_ContextSwitch (...) (...) #8 0xf608529c in Proc_SuspendProcess (...) (...) #9 0xf60aaf54 in Sig_Handle (...) (...) #10 0xf600ea7c in MachUserAction (...) (...) #11 0xf6010978 in MachReturnFromTrap () #12 0x3e734 in ?? () #13 0x3ff24 in ?? () #14 0x25214 in ?? () #15 0x24618 in ?? () #16 0x63c0 in ?? () #17 0x12404 in ?? () #18 0x16780 in ?? () #19 0x5c98c in ?? () Mary Log-Number: 30918 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 21 Apr 1991 22:01:25 PDT Subject: allspice crash Allspice crashed last evening after I booted the new 1.091 kernel. It died in OfsBlockFree, trying to free a bogus block number (-421800374). If I may complain about the sun4 debugger for a moment it wouldn't let me print out the value of "cylinderNum" in that routine, complaining that there was 'No symbol "cylinderNum" in current context.'. Also, in the calling routine Ofs_FileTrunc I was unable to print out the value of the block number because the blockAddrPtr of the indexInfo structure was a bogus value, although it must have worked when the kernel was running since it didn't die there. Maybe this thing is a pointer into the block cache? Right now allspice is running an old kernel. John Log-Number: 30919 Subject: uniq doesn't handle long lines well Date: Mon, 22 Apr 91 12:12:52 PDT From: Mike Kupfer <kupfer> I put uniq into an infinite loop processing some ID database references. I suspect the problem is that some of the lines were longer than the buffers in uniq, and uniq doesn't check for buffer overflow. There is a version of uniq on okeeffe that doesn't have this bug (and it seems to be non-AT&T code, as well). mike Log-Number: 30920 Date: Mon, 22 Apr 91 14:31:55 PDT From: dlong (Dean Long) Subject: DevTtyInit in devTtyAttach.c DevTtyInit() fails to get the console type on sun4c's with newer PROMs. I suggest the following patch: *** ../../../1.089/dev/sun4c.md/devTtyAttach.c Fri Oct 5 18:11:20 1990 --- devTtyAttach.c Sat Jan 19 17:59:02 1991 *************** *** 132,138 **** --- 132,142 ---- */ #ifndef sun2 + #ifdef sun4c + promConsoleType = *romVectorPtr->inSource; + #else promConsoleType = ((struct eeprom *) EEPROM_BASE)->ee_diag.eed_console; + #endif /* sun4c */ switch (promConsoleType) { case EED_CONS_TTYA: consoleUnit = 1; Log-Number: 30921 Subject: problem with X cmds location change? Date: Mon, 22 Apr 91 16:24:54 PDT From: Mary Baker <mgbaker> Buzz, a color sun3, can no longer run X since we've moved the cmds out of cmds.new and to their proper place. Can anyone think of any changes or mess-ups that might have occurred? Buzz used to run X okay on 1.089 a couple of weeks ago. I rebooted it with 1.089 and it no longer works. Mary Log-Number: 30925 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 23 Apr 1991 11:37:10 PDT Subject: allspice problems You may have noticed that allspice has been having some problems lately. Somehow a file handle gets locked, and everything piles up on it. This happened when allspice was running the 1.084 kernel, so I don't think it is something in the new kernel. The problem last night was caused by a process that had locked a file handle, and was waiting for a LFS checkpoint to complete: Sync_Wait(&lfsPtr->checkPointWait, FALSE). I was unable to determine why the checkPointWait condition was wedged. John Log-Number: 30926 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 23 Apr 1991 13:23:02 PDT Subject: more on allspice problems The problems with allspice seem to be limited to /sprite/src/kernel. I'm not sure why, but my guess is it has something to do with LFS since it is the only LFS disk on allspice. We are trying to come up with a plan of action, but in the meantime we are going to let allspice limp along. John Log-Number: 30927 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 23 Apr 1991 22:01:36 PDT Subject: lfs error Anise died because it had "read from a clean segment". My guess is that a segment was marked clean that really wasn't, and that this is bad. It happened right after a reboot after a crash so perhaps things got kind of screwed up on the disk. It looked continuable so I continued it. It then died with "LfsSetSegUsage called on a clean segment". I suppose this is related to the first bug. I continued this one too. More news as it happens. John Log-Number: 30928 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 23 Apr 1991 22:08:33 PDT Subject: anise crash Anise wedged up earlier this evening. The problem was a file handle that was left locked by an RPC server. I think fixing this bug should be a real high priority task. I might be able to look at it starting Thursday afternoon, but don't let that stop anyone else from volunteering. John Log-Number: 30930 Date: Wed, 24 Apr 91 15:19:00 PDT From: tve (Thorsten von Eicken) Subject: nfsmount performance I did a few tests (wouldn't call them benchmarks). The test is reading a 1.2 Mb file once (/sprite/src/benchmarks/read). The results I got are as follows (client/server): crackle/anise: 400Kb/s, 5Mb/s (cached). crackle/woosh: 118Kb/s (fresh nfsmount), 15Kb/s (worn-out nfsmount), crackle/ginger: 40Kb/s, 5Kb/s (loaded assault???) shallot/woosh: 200Kb/s, 6Mb/s (cached) shallot/ginger: 71Kb/s, 5Mb/s (cached) I conclude that the large read performance of nfsmount can be roughly 1/2 that of nfs, but that currently it gets trashed because of no caching and because it seems to wear-out, i.e. the process size grows and it gets slower. I'm also astonished that woosh, a sun386i responds faster than ginger. TvE NB: dunno if this should have gone to spriters rather than bugs... Log-Number: 30931 Date: Thu, 25 Apr 91 00:03:41 PDT From: shirriff (Ken Shirriff) Subject: /dev/tty Two problems: First, bogus /dev/tty files sometimes show up and cause problems. e.g. -rw-rw-r-- 1 ouster 521 Apr 19 09:28 /dev/tty Second, for Unix compatibility, we should probably have /dev/tty. Ken Log-Number: 30933 Date: Thu, 25 Apr 91 10:37:45 PDT From: jhh (John H. Hartman) Subject: lfs bug The lfs disk /user5 somehow got a corrupted checkpoint. I assume it was due to a crash during the checkpoint operation. Attaching the disk would cause the kernel to crash. We tried to run lfscheck on the disk, but this died also. We then ran lfsrebuild. This printed lots of error and warning messages, but did manage to complete. We then reattached the disk and it looked ok, but the following error message appeared on the syslog: LfsOkToRead read over segment boundary. Since this is a printf and not a panic I assume that it isn't critical. We then detached the disk and ran lfsrebuild again, just to see what it would do, and this time it died. So, I'm not sure what state /user5 is is, but I don't know what else to do except to attach it and see what happens. Just an LFS novice, John Log-Number: 30934 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 25 Apr 1991 11:42:19 PDT Subject: unknown file system problem I got the following message concerning a new rz57 disk attached to a ds5000, with an LFS on it. Obviously any one of these three could be at fault. John > From jclee Thu Apr 25 02:56:35 1991 > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA863306; Thu, 25 Apr 91 02:56:28 PDT > Date: Thu, 25 Apr 91 02:56:28 PDT > From: jclee (James C. Lee) > Message-Id: <9104250956.AA863306@sprite.Berkeley.EDU> > To: jhh > Subject: bug on LFS > > John, > > I think I've discovered a bug on your file system. Somehow I was > experiencing delays of ~10 seconds for "ls" in /scratch5/jclee/traces > earlier. The problem was on the entire file system, but rather only > that directory. I "cd"ed to /scratch5/jclee and "ls" works in normal > speed. I also tried "ls" in /scratch5/jclee/traces from different > machines just to make sure that it's the machine-dependent, and indeed > it's not machine-dependent. > > I should've left the directory untouched for you to see, but I wanted to > get my project going, and so I tinckled around with it. I "mv" all the > files in /scratch5/jclee/traces to another diretory and then "mv" them > back, and everything's back to normal. > > I don't know the details of LFS, but it seems that it may be that info > on directories were scattered around the tracks? I was creating multiple > *huge* files when the symptom first appeared. Since I heard that LFS > writes where the head is, this might cause fragmented directory info? > > Anyway, sorry that the bug is not repeatable. Hopefully the description > helps.... > > James > Log-Number: 30935 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 25 Apr 1991 11:43:27 PDT Subject: fscheck.c truncated The file /sprite/src/cmds/fscheck/fscheck.c was truncated. I restored it via RCS, and moved the bad copy to fscheck.c.bad. John Log-Number: 30939 Subject: RPC sanity check failure on murder Date: Thu, 25 Apr 91 18:07:33 PDT From: Mike Kupfer <kupfer> Murder died (while groveling through a dump tape :-( ) with RpcSanityCheck: packet too short, 98 < 126594 Rpc_SanityCheck: client -10816301, server -2145348087: Fatal Error: Sanity check failed on outgoing packet. It didn't respond to "kmsg -v", so I rebooted it. mike Log-Number: 30940 Date: Thu, 25 Apr 91 21:00:50 PDT From: root (The Sprite God) Subject: tar.gnu in debugger Tar.gnu went into the debugger while we were trying to restore /user5. The problem is that extract() has this structure called hstat which it passes on the stack to SpriteMakePseudoDev, but SpriteMakePseudoDev thinks it got a pointer to hstat. I'll try to fix it and start up the restore again. If it's not one thing, it's another... Mary (alias root while I've got no home directory) v~p Log-Number: 30942 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 25 Apr 1991 23:06:31 PDT Subject: rename bug The cause of anise wedging up is a bug in rename. The intent of the rename code is that the lookup loop produces a locked handle on the target. If the target exists it is deleted. Then a handle is locked on the source, and the link is made. What happened is that the lookup loop ended up with a handle for the source, thus the second locking attempt deadlocked. I can't figure out how this could have happened. The code looks correct. The parameters to the GrabHandle routine were correct. I think it must be a race of some sort. I've identified the program that caused the problem. It is a daemon of mine that organizes news articles in into threads of discussion. It uses a bunch of little files that it copies before it modifies them, then renames the new copy to the original name. The last time anise wedged up there were two daemons running, when there should have been only one. Perhaps this is a factor. We need to set up a small test environment to track this down. Obviously I've stopped running the daemon until the bug is found. John Log-Number: 30943 Date: Fri, 26 Apr 91 08:40:52 PDT From: bmiller (Bob Miller) Subject: anonymous ftp From jds@cs.UMD.EDU Fri Apr 26 08:21:32 1991 To: sprite-request@sprite.Berkeley.EDU Cc: jds@cs.UMD.EDU Subject: Sprite papers via anonymous FTP? Date: Fri, 26 Apr 91 11:20:20 -0400 From: James da Silva <jds@cs.UMD.EDU> Greetings, I'm interested in browsing some of your Sprite papers. I had heard that at least the Log-structured File System USENIX paper was available via anonymous FTP from sprite.berkeley.edu. However, when I try it: [darling 7] ftp sprite.berkeley.edu Connected to allspice.Berkeley.EDU. 220 allspice.Berkeley.EDU FTP server (Version 4.3 Wed Jul 11 23:01:30 PDT 1990) ready. Name (sprite.berkeley.edu:jds): anonymous 331 Guest login ok; supply userid as password. Password: 530 User ftp: can't change directory to /users/ftp. Login failed. Looks like the permissions are wrong? If this is on purpose, or if the papers aren't available via FTP, do you have a Sprite bibliography you could send my way? Thanks for your time, Jaime ........................................................................... : domain: jds@cs.umd.edu James da Silva : path: uunet!mimsy!jds Systems Design & Analysis Group Log-Number: 30944 Date: Fri, 26 Apr 91 08:53:56 PDT From: ouster (John Ousterhout) Subject: FTP area still unavailable John's message from yesterday implied that the FTP area had been moved off of /user5. However, ~ftp still doesn't exist so anonymous ftp doesn't work. Is this just a matter of changing the symbolic link at /users/ftp, or is something more complicated needed? -John- Log-Number: 30945 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 26 Apr 1991 10:38:36 PDT Subject: sun4 timer stops Anise stopped getting timer interrupts, so it stopped processing its timer queue. It's amazing how long the machine will keep running after this happens (it was still up 90 minutes after the timer stopped). I don't know why the timer stopped. It looks like the timer chip just stopped producing interrupts. I was unable to access the timer chip control registers from the debugger so I'm not sure what it thought it was doing. I tried to reset it from the debugger but that didn't work either. John Log-Number: 30946 Date: Fri, 26 Apr 91 13:07:20 PDT From: root (The Sprite God) Subject: Infinite recovery loop bug pretty much figured out Yesterday evening, when John and I were in a particularly bad mood, we took out our agressions on the infinite recovery loop bug. This is because it had decided to pop up on anise and allspice and serveral other machines in the middle of our trying to deal with the failures of /user5, /pcs, and a bunch of other things. It is an example of what John meant by those little bugs that get put aside and then get in the way when you're really in trouble. It popped up in such a way, though, that we had complete syslog information on multiple machines about what was going on. Usually one side of the mess is in a locked room someplace. Here's the scoop: When a prefix broadcast is done to determine what server is exporting a particular file system, only the machine exporting that prefix responds to the broadcast. All other machines give no response. If, though, a client knows that a particular server exported a prefix, then the next time the client does a prefix rpc, it does not do a broadcast. Instead, it sends just that prefix rpc to that server. If the server of that prefix has had that disk removed (/user5 on anise) or something else happens to make a prefix it was exporting no longer present, then that server doesn't respond to the prefix RPC sent to it. The client (in this case allspice) sees its prefix RPC timeout to anise. It marks anise as dead, puts it in its recovery list, and cleans up the state it was keeping for anise (removes anise from its clientList). Then, some program (cron or such) on anise tries to open some file on allspice and anise sends allspice a handle for the file. Allspice says "but you're not in my clientList for that handle. You must have some bogus handle." Allspice returns "STALE HANDLE" to anise. Then, since allspice still has anise on its recovery list, it will attempt to go through recovery with anise. The first part of recovery is an attempt to reopen all the prefixes for the server. Allspice does a prefix RPC for /user5 to anise and this times out again and the whole thing repeats itself... The solution: There are a couple of angles to handle this from. We can change things so that if a server receives a prefix RPC that's not a broadcast, it will always return something, either success or "no handle." This means that a prefix RPC that is not a broadcast will never time out on a client. The client receiving a "no handle" response can remove that prefix/serverID pair from its prefix table. This prevents the client from repeatedly hassling the server about a prefix its no longer serving, and the client is also free to broadcast for the handle in case that file system is now served by somebody else. But we have to be careful not to break the weird setup over in Cory Hall. Due to horrors in the past, the machines in cory have been hardwired with prefix/serverID pairs to avoid prefix broadcasts. So hardwired prefix/serverID pairs shouldn't be removed automatically from prefix tables. The question remains as to what's the cleanest way for a server to determine that a prefix request was specifically sent to it. It's easy to tell at the low-level of handling the RPC, but that's a disgusting place to put the fix. The right place is in the prefix service routine, but the rpc header information needed is lost by the time the prefix service routine is called. However, we can set a flag in the prefix request info sent from the client that will be seen by the server. I think this makes sense and will do it if there are no objections. Mary Log-Number: 30948 Date: Fri, 26 Apr 91 17:37:22 PDT From: tve (Thorsten von Eicken) Subject: small problem on /pcs Dump: can't lstat /pcs/tve/lib/santillana/mss/pn7y8.1/ID0127_PN8-27.poem: invali d argument Log-Number: 30949 Subject: missing LOCK_MONITOR in fsio Date: Fri, 26 Apr 91 17:58:42 PDT From: Mike Kupfer <kupfer> Boy, this is going to be fun. We should start a pool on how many locking bugs I'll discover in the next week. Fsio_StreamAddClient is missing a LOCK_MONITOR. By the way, Sync_Unlock doesn't verify that the lock is actually set before freeing it. Any objections to my adding this check? mike Log-Number: 30950 Subject: optimization settings Date: Fri, 26 Apr 91 18:22:50 PDT From: Mike Kupfer <kupfer> I noticed that I wasn't getting -Wall for the sync module. Further investigation leads to the following questions: (1) is it okay to have -g3 when optimization is turned off (for DECstations)? The current .mk files only use -g3 when optimization is turned on, which obviously works, but it leads to a somewhat crufty sequence in the .mk files. (2) is there some reason not to do the OFLAG and GFLAG assignments in tm.mk? They appear in bigcmd.mk, command.mk, kernel.mk, and library.mk, and the assignments aren't all the same. mike Log-Number: 30951 From: mgbaker (Mary Gray Baker) Subject: allspice crash Date: Fri, 26 Apr 91 19:26:42 PDT Allspice ran out of memory. That was pretty quick. We got a core, in case it can tell us anything. The restore of /user5 didn't finish before allspice crashed. Mary Log-Number: 30959 From: mgbaker (Mary Gray Baker) Subject: allspice crash Date: Sun, 28 Apr 91 19:33:21 PDT Allspice ran out of memory again. It always seems to do this when I attempt to restore /user5. I got another core image which we may or may not be able to look at with any more success than the last one. Mary Log-Number: 30966 Date: Mon, 29 Apr 91 16:22:40 PDT From: ouster (John Ousterhout) Subject: Re: allspice crash Is it possible that the restore program is not properly closing files? Since we have no limit on the number of open files in Sprite, this might be causing Allspice to run out of memory. -John- Log-Number: 30952 From: mgbaker (Mary Gray Baker) Subject: Can't debug allspice core Date: Fri, 26 Apr 91 19:37:40 PDT Bringing the allspice core up in the debugger, it says panic (ptrace: I/O error. Cannot read memory: address 0x64 out of bounds. and cannot give me a stack trace. Mary Log-Number: 30957 Date: Sun, 28 Apr 91 15:43:45 PDT From: shirriff (Ken Shirriff) Subject: ipServer went into debugger The ipServer died when it tried to do a bad free. It was in TCP_SocketDestroy trying to free tcpPtr->templatePtr. Ken Log-Number: 30958 Date: Sun, 28 Apr 91 15:48:04 PDT From: shirriff (Ken Shirriff) Subject: tk include files messed up /usr/include/tk.h includes tkInt.h, which only exists in /sprite/src/lib/tk. Ken Log-Number: 30961 From: mgbaker (Mary Gray Baker) Subject: debugging out-of-memory crash Date: Sun, 28 Apr 91 20:14:41 PDT The reason we cannot debug out-of-memory crashes using kgcore seems to be an incompatibility between how kgcore lays out the different kernel segments (when one of them is too big) and what the debugger thinks the addresses are. This warrants further investigation. Mary Log-Number: 30962 Date: Mon, 29 Apr 91 00:18:37 PDT From: mottsmth (Jim Mott-Smith) Subject: Assault died Assault died at about 11:50pm with Fatal error: HandleRelease handle <1,25,0,59200> "cmds" not locked Syncing disks FslclLookup, missing '..' link: ID <25,0,44624> Ken claims responsibility. He was in a directory in one window and deleted the directory from another. -- Jim M-S (part-time ddj) Log-Number: 30964 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 29 Apr 1991 10:44:22 PDT Subject: Re: Assault died This is a known bug that has been around for quite a while. I remember looking at it, but deciding that it wasn't trivial to fix. I think we should add it to our spring cleaning list if someone doesn't fix it before then. John Log-Number: 30963 Date: Mon, 29 Apr 91 00:20:07 PDT From: mottsmth (Jim Mott-Smith) Subject: Disk space message When I try to initialize the tape in preparation for a weekly backup it says Dump: error writing to /sprite/admin/dump/dumplog: no more space in file system domain. a df show 19K blocks available. What's happening here? -- Jim M-S Log-Number: 30969 Date: Mon, 29 Apr 91 13:39:57 PDT From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Lfs crashed allspice Allspice died with bad LfsStableMemBlockHdr on /user6. I took a core and rebooted. We now have 3 core files saved; allspice seems to crash faster than we are examining cores. Log-Number: 30970 From: mgbaker (Mary Gray Baker) Subject: Re: Lfs crashed allspice Date: Mon, 29 Apr 91 18:26:26 PDT I am examining the lfs crash core file at this moment. I cannot find the list crash corefile and already sent mail asking where it is. The other core file is from yesterday, and I already looked at it. Mary Log-Number: 30971 From: mgbaker (Mary Gray Baker) Subject: allspice lfs crash Date: Mon, 29 Apr 91 19:18:36 PDT I took a look at the core for the lfs crash earlier today. It died in LfsStableMemFetch() at line 454. It was trying to do a lookup of /user5/kupfer. The prefixPtr has the serverID set to 14, namely allspice. In LfsStableMemFetch, where it died, the hdrPtr seems to point to garbage, in fact, it appears to be the error string, however blockPtr->blockAddr, from which it's set, has a different, also bad, address. I don't know what's going on here. There's probably something I'm not understanding about the debugger. (gdb) p/x hdrPtr $25 = 0xf606b0d8 (gdb) x/s $25 0xf606b0d8 <LfsStableMemWriteDone+240>: (char *) 0xf606b0d8 "Bad LfsStableMemBlockHdr\n" (gdb) p/x blockPtr->blockAddr $29 = 0xf88e8000 (gdb) list 449 #ifdef ERROR_CHECK 450 hdrPtr = (LfsStableMemBlockHdr *) blockPtr->blockAddr; 451 if ((hdrPtr->magic != LFS_STABLE_MEM_BLOCK_MAGIC) || 452 (hdrPtr->memType != smemPtr->params.memType) || 453 (hdrPtr->blockNum != blockNum)) { 454 LfsError(smemPtr->lfsPtr, FAILURE, "Bad LfsStableMemBlockHdr\n"); 455 } 456 #endif /* ERROR_CHECK */ 457 entryPtr->addr = blockPtr->blockAddr + offset; 458 entryPtr->blockNum = blockNum; (gdb) p/x *hdrPtr $22 = { magic = 0x42616420, memType = 0x4c667353, blockNum = 0x7461626c, reserved = 0x654d656d } (gdb) where #0 panic (__builtin_va_alist=-167350692) (sysPrintf.c line 220) #1 0xf6066e9c in LfsError (...) (...) #2 0xf606b358 in LfsStableMemFetch (smemPtr=(struct LfsStableMem *) 0xf65ebeb8, entryNumber=636, flags=0, entryPtr=(struct LfsStableMemEntry *) 0xf80b3640) (lfsStableMem.c line 454) #3 0xf6062658 in LfsDescMapGetDiskAddr (lfsPtr=(struct Lfs *) 0xf65ebb80, fileNumber=54234, diskAddrPtr=(ClientData) 0xf80b36d4) (lfsDescMap.c line 150) #4 0xf6061d28 in Lfs_FileDescFetch (domainPtr=(struct Fsdm_Domain *) 0xf680e3b8, fileNumber=54234, fileDescPtr=(struct Fsdm_FileDescriptor *) 0xf6bddc68) (lfsDesc.c line 73) #5 0xf60387e8 in Fsdm_FileDescFetch (...) (...) #6 0xf60439ac in Fsio_LocalFileHandleInit (...) (...) #7 0xf604b3a0 in FindComponent (parentHandlePtr=(struct Fsio_FileIOHandle *) 0xf6bdc2a8, component=(char *) 0xf80b3af8 "kupfer", compLen=6, isDotDot=0, curHandlePtrPtr=(struct Fsio_FileIOHandle **) 0xf80b395c, dirOffsetPtr=(ClientData) 0xf80b3954) (fslclLookup.c line 869) #8 0xf604a67c in FslclLookup (prefixHdrPtr=(struct Fs_HandleHeader *) 0xf6bdc2a8, relativeName=(char *) 0xf6664fd8 "kupfer", rootIDPtr=(struct Fs_FileID *) 0xf6664be8, useFlags=0, type=0, clientID=72, idPtr=(struct Fs_UserIDs *) 0xf6664c0c, permissions=0, fileNumber=0, handlePtrPtr=(struct Fsio_FileIOHandle **) 0xf80b3c8c, newNameInfoPtrPtr=(struct Fs_RedirectInfo **) 0xf80b3d3c) (fslclLookup.c line 402) #9 0xf604989c in FslclGetAttrPath (prefixHandlePtr=(struct Fs_HandleHeader *) 0xf6bdc2a8, relativeName=(char *) 0xf6664fd8 "kupfer", argsPtr=(char *) 0xf6664bd8 , resultsPtr=(char *) 0xf80b3d48 "\366\265S8\366\265SH", newNameInfoPtrPtr=(struct Fs_RedirectInfo **) 0xf80b3d3c) (fslclDomain.c line 248) #10 0xf6056670 in Fsrmt_RpcGetAttrPath (srvToken=(ClientData) 0xf6663b68, clientID=-161068072, storagePtr=(struct Rpc_Storage *) 0xf80b3dc8) (fsrmtAttributes.c line 212) #11 0xf60ade88 in Rpc_Server (...) (...) #12 0xf60b3310 in Sched_StartKernProc (...) (...) Mary Log-Number: 31003 From: mendel (Mendel Rosenblum) Subject: Re: allspice lfs crash Date: Sat, 04 May 91 15:59:24 PDT > In LfsStableMemFetch, where it died, the hdrPtr seems to point to garbage, > in fact, it appears to be the error string, however blockPtr->blockAddr, > from which it's set, has a different, also bad, address. I don't know what's > going on here. There's probably something I'm not understanding about the > debugger. > > > > (gdb) p/x hdrPtr > $25 = 0xf606b0d8 > (gdb) x/s $25 > 0xf606b0d8 <LfsStableMemWriteDone+240>: (char *) 0xf606b0d8 "Bad LfsStableMemBlockHdr\n" > > (gdb) p/x blockPtr->blockAddr > $29 = 0xf88e8000 > > > (gdb) list > 449 #ifdef ERROR_CHECK > 450 hdrPtr = (LfsStableMemBlockHdr *) blockPtr->blockAddr; > 451 if ((hdrPtr->magic != LFS_STABLE_MEM_BLOCK_MAGIC) || > 452 (hdrPtr->memType != smemPtr->params.memType) || > 453 (hdrPtr->blockNum != blockNum)) { > 454 LfsError(smemPtr->lfsPtr, FAILURE, "Bad LfsStableMemBlockHdr\n"); > 455 } > 456 #endif /* ERROR_CHECK */ There are two problems with the debugger here. The first is since we compile with -O the value of hdrPtr is no longer available after its last use. The compiler put hdrPtr in $o2 and then trashed $o2 doing the call to LfsError. The second problem is the kgcore program doesn't transfer the file cache by default. (The -c will cause kgcore to dump the file cache but it will take too long on allspice.) This means the memory at blockPtr->blockAddr is not accessible to the debugger. Mendel Log-Number: 30973 Date: Mon, 29 Apr 91 21:28:17 PDT From: bsw!adam@uunet.UU.NET (Adam de Boor) Subject: allspice lfs crash I don't know how much optimization you folks turn on in gcc, but the thing with hdrPtr being the error message looks a lot like an optimization. gcc probably put hdrPtr in an o register since it's not used outside the ERROR_CHECK block. I think "info address hdrPtr" will tell you what register the thing's in, maybe? I've seen this many, many times when I use gcc... a Log-Number: 30972 From: mgbaker (Mary Gray Baker) Subject: More about allspice crash Date: Mon, 29 Apr 91 19:20:24 PDT I forgot to say that allspice was running the 1.084 kernel when it got the lfs crash. I don't know why it had been rebooted with that kernel rather than "new." Mary Log-Number: 30974 Date: Tue, 30 Apr 91 09:54:29 PDT From: mottsmth (Jim Mott-Smith) Subject: Allspice died with "Bad LfsStableMemBlockHdr" (There JHH, I put the crash cause in the subject line). The console said the following: Warning SCSI3#3 DMA bus error DevRawBlockDevRead: error 0x0 inlength 512 at offset 0x102ed600 outlength 0 Fsdm_FileDescFetch found junky file desc Fsio_LocalFileHandleInit: Fsdm_FileDescGetch of 147472 failed 0x1 FindComponent , no handle <0x1> for "admin" filenumber 147472 Fatal Error: LfsError: on /sprite/src/kernel status 0x1, Bad LfsStableMemBlockHdr JHH hypothesized that the driver code doesn't handle bus resets very well. Not understanding that the device was reset, it went ahead and read garbage when poor LFS tried then tried to dereference. No core dump was taken. -- Jim M-S Log-Number: 30975 Date: Tue, 30 Apr 91 12:27:51 PDT From: shirriff (Ken Shirriff) Subject: lpr locks up sage Twice when I've tried to print a file, sage (the machine attached to the printer) has locked up. (Sorry Mike!) Ken Log-Number: 30976 Date: Tue, 30 Apr 91 12:55:13 PDT From: shirriff (Ken Shirriff) Subject: lpr shutting off timer The problem with sage was apparently that the printer cable was loose. The connector on the cable is missing the tab to make it snap into place securely. We should either get the cable repaired or make a note of the problem for next time. The effect of the loose cable was that the timer was stopping. This is interesting since John encountered timer problems with something else. Ken Log-Number: 30978 From: mgbaker (Mary Gray Baker) Subject: tar.gnu arguments Date: Tue, 30 Apr 91 17:57:50 PDT This is probably supposed to be standard knowledge, but what are the differences between tar and tar.gnu? There are some mystery arguments to tar.gu that aren't arguments to tar, and there's no man page for tar.gnu. Some of the arguments I can see easily in the code, but others are vague. Mary Log-Number: 30979 Date: Wed, 1 May 91 12:45:22 PDT From: ouster (John Ousterhout) Subject: Trashed mail At some point during the various Sprite crashes last weekend my mail spool file inherited two bogus "messages". One consists of a piece of syslog output, I think: tocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: bad protocol 90 Warning: Sock_ReturnError: From guarino@src.dec.com Mon Apr 29 10:07:41 1991 and the other is apparently a bunch of nulls. This can probably be explained by Sprite's "sync on a client doesn't sync through to disk" behavior. I thought of a way to verify that nothing has been lost after such an occurrence: look at the mail spool file, count the number of bytes in the nulls or inherited garbage, and see if that number corresponds to the exact size of a later message. This would occur if a crash occurs in the middle of receiving a message, leaving garbage in the spool file, but the message is retransmitted later. Unfortunately I didn't think of this in time to try it on my mailbox. -John- Log-Number: 30980 Subject: raid1 stuck -> reboot Date: Wed, 01 May 91 14:12:51 PDT From: Mike Kupfer <kupfer> raid1 wouldn't let me log in, even on the console. L1-p showed a bunch of RPC servers all waiting on the same event, and L1-z showed that most of them were reopens. I suspect that some process locked /r1 and then hung, so that eventually the world ground to a halt. I rebooted raid1. mike Log-Number: 30981 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 1 May 1991 17:15:12 PDT Subject: strlen on invalid prefix Loiter crashed because it tried to do a strlen on the prefixPtr->prefix inside of FsprefixLookupRedirect. Unfortunately either prefixPtr was pointing to the wrong place, or its contents were garbage. >From looking at the list of prefixes it looks like the pointer was bad. It was pointing at something that looked to me to be an IP packet (ARP broadcast from csgw2). It occurred to me that perhaps we got an interrupt at a bad time and didn't restore a register. An examination of the code didn't turn up anything obvious. Another possiblity is that the prefixPtr was never set to anything. It is not initialized to NIL, so perhaps it was used uninitialized and just happened to end up pointing to an ethernet packet. I looked through the code for GetPrefix, but didn't see any path that could cause this to happen. The machine was a ds5000 running kernel DS5000.JHH.2110. John Log-Number: 30982 From: mgbaker (Mary Gray Baker) Subject: prefixPtr sparcstation crash & debugger question Date: Wed, 01 May 91 20:08:16 PDT Jaywalk crashed today while indirecting through a prefixPtr in FsprefixLookupRedirect line 608. In the calling routine, the prefixPtr passed to FsprefixLookupRedirect is valid, but the structure is zeroed out. However, the crash was just dereferencing the pointer. It was doing a lookup as part of an open of ../../lib/include/sun4.md/sys.h. Part of what is don't understand is the stack: (gdb) where #0 panic (__builtin_va_alist=-167711443) (sysPrintf.c line 220) #1 0xf600f1c8 in MachHandleTrap (trapType=112, pcValue=(char *) 0xf6050e60 "\344\004 \b\240\004\340\004@", trapPsr=4194501) (sun4c.md/machCode.c line 1557) #2 0xf601093c in MachReturnFromTrap () #3 0xf6054aa8 in FsrmtOpen (prefixHandle=(struct Fs_HandleHeader *) 0xf814391c, relativeName=(char *) 0xf8143a58 "../Include/sun4c.md/user/sys.h", argsPtr=(char *) 0xf6568c08 , resultsPtr=(char *) 0xf8143938 "\366 E\330", newNameInfoPtrPtr=(struct Fs_RedirectInfo **) 0xf8143894) (fsrmtDomain.c line 338) #4 0xf6050584 in Fsprefix_LookupOperation (fileName=(char *) 0xf8143a58 "../Include/sun4c.md/user/sys.h", operation=2, follow=4096, argsPtr=(char *) 0xf8143970 , resultsPtr=(char *) 0xf8143938 "\366 E\330", nameInfoPtr=(struct Fs_NameInfo *) 0xf64da250) (fsprefixOps.c line 210) #5 0xf602e2ac in Fs_Open (...) (...) #6 0xf602ffe4 in Fs_OpenStub (...) (...) #7 0xf60114b0 in MachFetchArgsEnd () Why is FsrmtOpen on the stack in frame 3? It was called earlier from frame 4, but then frame 4 called FsprefixLookupRedirect which is where the pc was when the machine crashed. I would expect to see FsprefixLookupRedirect instead of FsrmtOpen as the routine in frame 3. Is the return address for a previous call not being overwritten by the next one? This would seem to mean the register windows aren't being flushed or aren't being flushed in the right place. If someone has a better explanation, please tell me. Mary Log-Number: 30983 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 1 May 1991 20:40:07 PDT Subject: Re: prefixPtr sparcstation crash & debugger question This bug appears similar to loiter's crash that I reported earlier (strlen). My guess is that deleting a prefix doesn't work properly in some cases. Last night as part of the change to the new disk I deleted the /sprite/src/kernel prefix from all machines. Both Jaywalk and Loiter crashed while doing a LookupRedirect in which the current directory was /sprite/src/kernel and the pathname was relative and started with "..". Also, in both cases I think the a file open was the cause of the lookup. john Log-Number: 30984 From: mgbaker (Mary Gray Baker) Subject: tar fooling me Date: Wed, 01 May 91 20:32:25 PDT I tried to dump our traces to a tape using tar. I did the following tar cvhf /dev/exb1.nr /traces/allspice /traces/anise /traces/assault >&! out And it put the names of all the trace files in out, so it seemed like it was doing the right thing. Then, just to check it was all there, I rewound the tape and did a tar tvf /dev/exb1.nr >&! out There were no files on the tape and out was empty. What did I do wrong? Mary Log-Number: 30985 Subject: lock holderPCBPtr field Date: Thu, 02 May 91 12:01:12 PDT From: Mike Kupfer <kupfer> Does anyone know why the holderPCBPtr field in a lock or master lock is defined as an Address? Why isn't it a (Proc_ControlBlock *)? mike Log-Number: 30986 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 2 May 1991 12:04:03 PDT Subject: Re: lock holderPCBPtr field Once again it is probably due to a circularity in the header files. Perhaps this circularity is now gone, so that it could be redefined. Any time you see something defined as an Address when it obviously should have another type it is probably due to past problems with header files. John Log-Number: 30987 Subject: Re: problem with msgs Date: Thu, 02 May 91 12:40:48 PDT From: Mike Kupfer <kupfer> > Date: Thu, 2 May 91 12:08:50 PDT > From: eklee (Edward K. Lee) > To: spriters > Subject: problem with msgs > > I only get the headers without the body. > > Ed Well, here are the clues. "msgs" works fine on suns. It also works fine on assault, which is the server that nfsmounts the msgs partition. However, it doesn't work on any of the half-dozen or so DECstations that I tried (either 3100s or 5000s). If I "more" the msgs files on a DECstation, I see that the first 76 characters of the file are missing. This screws up the headers, which is probably confusing "msgs". If I "cat" the files, they look fine. Hypothesis: "more" and "msgs" do an ioctl that "cat" does not, and somebody is mishandling the ioctl, causing the first 76 characters of the file to get dropped on the floor. Does this ring any bells with any of the Spriters...? mike Log-Number: 30988 Date: Thu, 2 May 91 12:43:29 PDT From: shirriff (Ken Shirriff) Subject: Re: problem with msgs We've had problems before with lseeks failing with nfs files due to a byte swapping file. I bet that "more" reads the first 76 characters to make sure it's not an object file and then lseeks to the beginning to reread the file. Ken Log-Number: 30989 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 2 May 1991 12:44:41 PDT Subject: Re: problem with msgs This is a bug in nfsmount. Try doing "more" on a decstation of any file accessed via nfsmount. You end up with the first bunch of characters missing. I think it is because more does on lseek back to the beginning of the file, and nfsmount isn't resetting the offset correctly or some such thing. Didn't somebody fix this already? John Log-Number: 30990 Subject: deadlock when remote exec fails Date: Thu, 02 May 91 14:52:49 PDT From: Mike Kupfer <kupfer> A process migrated from nutmeg to catnip. It was supposed to do a remote exec, but that failed with SYS_ARG_NOACCESS. When exiting it tried to lock its pcb, which deadlocked because it had locked the pcb in Proc_ResumeMigProc. mike -- (gdb) bt #0 0xe004132 in Mach_ContextSwitch () #1 0xfeedbabe in ?? () #2 0xe081cde in SyncEventWaitInt (event=237605708, wakeIfSignal=0) (syncLock.c line 655) #3 0xe08123e in Sync_SlowWait ( conditionPtr=(struct Sync_Condition *) 0xe29934c, lockPtr=(struct Sync_KernelLock *) 0xe0c92c4, wakeIfSignal=0) (syncLock.c line 298) #4 0xe071e02 in Proc_Lock ( procPtr=(struct Proc_ControlBlock *) 0xe2992e0) (procTable.c line 416) #5 0xe0682ec in ProcExitProcess ( exitProcPtr=(struct Proc_ControlBlock *) 0xe2992e0, reason=4, status=5, code=0, thisProcess=1) (procExit.c line 538) #6 0xe067dba in Proc_ExitInt (reason=4, status=5, code=0) (procExit.c line 270) #7 0xe067ae6 in ProcDoRemoteExec ( procPtr=(struct Proc_ControlBlock *) 0xe2992e0) (procExec.c line 1878) #8 0xe06ef58 in Proc_ResumeMigProc (pc=106756) (procRemote.c line 313) Log-Number: 30991 Date: Thu, 2 May 91 17:04:21 PDT From: kupfer (Mike Kupfer) Subject: changing the dump scripts Somebody has been editing the dailydump and weeklydumps in /sprite/admin.sun4, even though the sources are RCS'd and living in /sprite/src/admin/{daily,weekly}dump. I think I put the scripts under RCS after some anonymous person broke the scripts and I had to fix them by hand. I would prefer that we continue to keep the scripts under RCS. If we want to move the RCS directory to /sprite/admin.sun4, I won't complain too loudly, though it's inconsistent with the usual Sprite file organization. mike Log-Number: 30993 Subject: Re: problem with X11 from parsley Date: Thu, 02 May 91 21:35:19 PDT From: Mike Kupfer <kupfer> Do you know if the Mac actually does an rlogin, or does it try to telnet in? It's not immediately obvious to me what the problem might be, so I'm forwarding your message to "bugs" so that it gets discussed at the weekly Sprite meeting. mike -- Date: Thu, 2 May 91 17:38:45 PDT >From: randy (Randy Katz) To: sprite Subject: problem with X11 from parsley I have succeeded in making MacX work from my macintosh to mercenary without difficulty. However, when I attempt to connect to mayhem, I get some problems about password incorrect while establishing the link (parsley, my MAC, attempts to rlogin to mayhem to issue the appropriate xterm command). It seems that xterm works fine when initiated from mayhem. Some time ago, because I had problems with telnet, parsley was deleted from allspice's network id file (I think). could this be the root of the password problem? By the way, mercenary is a Sun SPARCSTATION running SUN OS mayhem is a DS5000 running sprite I attempted this same thing with ginger, but something faulted, either the mac or in ginger (something about a 68881 fault). randy Log-Number: 30994 From: mgbaker (Mary Gray Baker) Subject: allspice server mutex deadlock Date: Thu, 02 May 91 21:56:02 PDT Allspice died tonight with a dead lock. It tried a bunch of times to go into the debugger, but didn't make it. Proc: serverMutex @0xf61e2628 Holder pc: 0xf60945f8 Current pc: 0xf605459c Holder PCB: 0xf662fc68 Current PCB: 0xf662fc68 Mary Log-Number: 30995 Date: Fri, 3 May 91 00:42:05 PDT From: root (The Sprite God) Subject: Assault died with repeated TLB load error Assault went crazy and started printing MachKernelExceptionHandler: Address error on load. addr 17 pc 800a2ba0 Entering debugger with a TLB load addr error exception at PC 0x800a2ba0 endlessly on the console. It wouldn't go into the debugger so I rebooted it. --Jim M-S Log-Number: 30997 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 3 May 1991 09:49:02 PDT Subject: Re: Assault died with repeated TLB load error This has happened before. There is a bug in the debugger, such that you get an address error trying to parse a packet or something. That sends you back to the debugger, causing the loop. Assault was running 1.079. This kernel has been deleted from /sprite/src/kernel/sprite, so I couldn't see where the pc was. Also, the address is always 17. John Log-Number: 30996 Date: Fri, 3 May 91 09:13:24 PDT From: randy (Randy Katz) Subject: two messages concatenated in Sprite mail From mike@postgres.Berkeley.EDU Thu May 2 19:01:37 1991 Date: Thu, 2 May 91 19:00:47 -0700 From: mike@postgres.Berkeley.EDU (Mike Stonebraker) To: randy@sprite.Berkeley.EDU Subject: try this -- references are at end or from DEC proposal [bulk of message deleted -mdk 5/3/91] This requires indexing a region of spatial data and a region of time accoFrom kupfer Thu May 2 21:35:33 1991 Received: by sprite.Berkeley.EDU (5.59/1.29) id AA532794; Thu, 2 May 91 21:35:20 PDT Message-Id: <9105030435.AA532794@sprite.Berkeley.EDU> To: randy Cc: bugs Subject: Re: problem with X11 from parsley In-Reply-To: Your message of Thu, 02 May 91 17:38:45 -0700 Date: Thu, 02 May 91 21:35:19 PDT From: Mike Kupfer <kupfer> [bulk of message deleted -mdk 5/3/91] Log-Number: 30998 Date: Fri, 3 May 91 17:37:34 PDT From: randy (Randy Katz) Subject: rlogin SUNOS to Sprite I can rlogin from sprite to SunOS, but for some reason I can't go the other way. On mayhem, a sprite d5000, I go over to mercenary, a sun os sparcstation. when I try to rlogin back to mayhem, the connection times out. I have mercenary in the .rhosts file on mayhem too. randy Log-Number: 30999 Date: Fri, 3 May 91 18:30:35 PDT From: shirriff (Ken Shirriff) Subject: allspice deadlock Allspice and raid1 got in a deadlock, so I rebooted raid1, which cleared it up. I'll take a look at the cores to figure out what happened. Ken Log-Number: 31000 Date: Fri, 3 May 91 23:02:27 PDT From: eklee (Edward K. Lee) Subject: Re: allspice deadlock >>Allspice and raid1 got in a deadlock, so I rebooted raid1, which cleared >>it up. I'll take a look at the cores to figure out what happened. >>Ken I may have provoked this deadlock. A process on raid1 was waiting for a callback from a block IO procedure with at least one file handle locked. To save time I wanted to see if I could continue it instead of rebooting raid1 so I diddled the data structures to return a failed status and then manually executed Sync_MasterBroadcast from the debugger. The result was not what I wanted so I was about to reboot raid1 when RPCs to allspice started to time out. I went home at this point, thinking that allspice had crashed. Ed Log-Number: 31001 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 3 May 1991 23:15:09 PDT Subject: anise crash, handle not locked Anise crached because a handle wasn't locked. I took a core. I'll look at it tomorrow if someone else doesn't first. John Log-Number: 31002 From: mendel (Mendel Rosenblum) Subject: Allspice, anise, assault crash Date: Sat, 04 May 91 12:58:35 PDT When I came in this morning allspice was hung and anise and assault were in the debugger. I could not login to allspice because it was out of processes. The problem is that when assault dies allspice fills its process table with sendmail processes waiting for assault. Assault died because it ran out of memory. I decided to just reboot assault with the hope that this would unwedge allspice so I could debug anise which was in the debugger because it tried to unlock a file handle that was not locked. More on anise later. Assault couldn't reboot because allspice wasn't answering its requests for "/". I tried to type L1-i to see what was wrong and the L1-i code seg faulted and put allspice in the debugger. I took a core dump of allspice into /home/ginger/pnh/cores/vmcore.allspice.crash.l1i if the author of the L1-i code is interested. The problem with anise appears to be a shell on sedition that was sitting in a deleted directory tree. Each time sedition went thru recovery it tried to open this file and crashed anise. Is this a known bug? Mendel Log-Number: 31004 From: mgbaker (Mary Gray Baker) Subject: pc from last Proc: serverMutex allspice crash Date: Sun, 05 May 91 15:05:24 PDT I forgot to mail out what the pc's were from the message a couple of days ago about the proc serverMutex deadlock.` The holder pc was 0xf60945f8 which claims to be line 151 in procDebug.c: status = ProcGetNextDebug(...); The current pc was 0xf609459c which claims to be line 144 in the same place: the switch on different requests (PROC_GET_THIS_DEBUG, etc). I don't think this looks right. Mary Log-Number: 31005 Subject: gcc on DECstations lacks the rcsid recognizer? Date: Mon, 06 May 91 16:24:04 PDT From: Mike Kupfer <kupfer> If I build mach/sun3.md/machEeprom.c on a DECstation, I get a complaint about the RCS id being defined by not used. I don't get this problem if I build it on a Sun. I assume that the problem is simply that gcc on the DECstations didn't get updated to recognize RCS ids. mike Log-Number: 31007 Date: Tue, 7 May 91 14:12:25 PDT From: elm (ethan miller) Subject: another mail/filesystem problem Once again, a mail message was munged. Mail from me to margo@postgres had about 600 lines of eklee's spool file appended. His spool file seems to be OK now. This happened around an allspice crash. My machine is a sun4c; I'm not sure of anything about Ed's machine. ethan Log-Number: 31008 Date: Tue, 7 May 91 13:09:52 PDT From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice crash Allspice crashed due to an ioctl being done on /dev/raid instead of /hosts/raid/... So this was a user error, not a sprite bug. Ken Log-Number: 31009 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 7 May 1991 15:30:24 PDT Subject: Re: Allspice crash I disagree that this isn't a Sprite bug. I don't think an ioctl should ever bring the machine down. John Log-Number: 31010 Date: Tue, 7 May 91 15:53:38 PDT From: eklee (Edward K. Lee) Subject: Re: Allspice crash >>I disagree that this isn't a Sprite bug. I don't think an ioctl should >>ever bring the machine down. >>John Right you are John. I've put in checks in the kernel to make sure the same thing doesn't happen again. Ed Log-Number: 31023 Date: Fri, 10 May 91 08:58:42 PDT From: ouster@ginger.Berkeley.EDU (John Ousterhout) Subject: Allspice crash Allspice died shortly after I came in this morning. The error message was: Kernel page fault at illegal pc: 0xf6032644, addr 0x4973a I took a core dump for the ddj, which is on ginger in the file /home/ginger/pnh/cores/vmcore.allspice.May10. I renamed the existing vmcore file (about a week old) to vmcore.allspice.whoKnows or something like that. There are quite a few core files in that directory now. Would it make sense to delete some of the older ones? -John- Log-Number: 31011 Date: Tue, 7 May 91 19:22:48 PDT From: dlong (Dean Long) Subject: rdate rdate needs to be relinked. The 8/23/89 version does not allow blank lines in /etc/spritehosts, even though the current C library (Next_Host.c) allows blank lines. dl Log-Number: 31012 Subject: sun4 ld broken on DECstations Date: Wed, 08 May 91 16:58:03 PDT From: Mike Kupfer <kupfer> The ld for the sun4 that runs on the DECstations seems to be broken. I think that's what caused the problems I had with the sun4 kernel Monday night. I suspect that it's the -r problem that we had last year, and that the fixed version was never installed on the DECstations. mike Log-Number: 31013 Subject: trashed libc source file Date: Wed, 08 May 91 20:18:58 PDT From: Mike Kupfer <kupfer> I discovered that /sprite/src/lib/c/gnulib/sun3.md/_divdf3.s had gotten trashed sometime recently. I restored it from the 10 April full dumps and renamed the trashed version to _divdf3.s.bad, in case anyone wants to take a look at it. mike Log-Number: 31014 Date: Thu, 9 May 91 08:54:46 PDT From: root (The Sprite God) Subject: anise is down. I someone taking care? Thanks. TvE (also: the ipserver on allspice is dead) Log-Number: 31015 Subject: proc macro arguments not in parentheses Date: Thu, 09 May 91 11:38:47 PDT From: Mike Kupfer <kupfer> Many of the macros in the user and kernel proc.h do not put parentheses around their arguments. Thus we have #define Proc_ComparePIDs(p1, p2) (p1 == p2) when we really want #define Proc_ComparePIDs(p1, p2) ((p1) == (p2)) mike Log-Number: 31017 Date: Thu, 9 May 91 13:14:08 PDT From: shirriff (Ken Shirriff) Subject: xproof can't find fonts About 5 days ago xproof quit working for me on the ds3100. When I try to run it, it says: "Unable to load any useable ISO8859-1 fonts." Any ideas? Ken Log-Number: 31020 Subject: Re: xproof can't find fonts Date: Thu, 09 May 91 14:25:26 PDT From: Mike Kupfer <kupfer> It works for me, at least to preview man pages. What are you trying to view? One possibility is that your X server has somehow gotten confused and should be restarted. mike Log-Number: 31021 Subject: "ld" status Date: Thu, 09 May 91 16:21:23 PDT From: Mike Kupfer <kupfer> The ld in /sprite/cmds.sun4 is apparently from /sprite/src/old/cmds/ld.old. It does the work itself. The ld in cmds.{sun3,symm} and the DECstation gld are front-end programs that invoke a machine-specific program in /sprite/lib/gcc to do the real work. Current problems: 1. The sun4 ld in /sprite/lib/gcc/ds3100.md is broken, at least when using the -r option. (The version in sun3.md seems to work, though.) I've replaced it with a shell script, so that "gld -msun4" on a DECstation will fail with a useful error message. 2. The new ld sources (/sprite/src/cmds/ld.$MACHINE) are in something of a chaotic state, particularly the sun4 version. Furthermore, the new ld, at least for the sun4, is broken worse than the ld I just disabled. It seems to me like we should either (a) fix up the "new" ld (the one that uses a front end) or (b) move the old ld out of /sprite/src/old and back to /sprite/src/cmds, and fix the two different versions of ld so that "make" will do the right thing depending on the target machine type. mike Log-Number: 31022 From: mendel (Mendel Rosenblum) Subject: LFS checkpoint corruption bug fixed Date: Thu, 09 May 91 20:15:37 PDT I fixed the bug that was causing LFS checkpoints to be corrupted. This is the bug that killed /user5 and caused problems for /pcs. The problem should not effect the LFS file systems on allspice. Only the older LFS file system of which /pcs is the only one left are in danger. Anise should run a kernel made from the uninstalled lfs module (such as sun4.md/mendel) until the lfs module is installed. Mendel Log-Number: 31024 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 10 May 1991 10:31:48 PDT Subject: copy bug in Fs_ReadLinkStub The routine Fs_ReadLinkStub copies stuff directly into user space. If the buffer pointer is bogus the machine will crash, as did allspice this morning. I think this bug has already been reported by Mary, since it appears that she put in a fix in the uninstalled sources. The fix involves calling Proc_ByteCopy. John Log-Number: 31026 From: mgbaker (Mary Gray Baker) Subject: Re: copy bug in Fs_ReadLinkStub Date: Fri, 10 May 91 11:23:38 PDT This bug has indeed been fixed. But it's in the uninstalled module still. Mary Log-Number: 31027 Date: Fri, 10 May 91 11:49:19 -0700 From: slater@ucbarpa.Berkeley.EDU (Mel Slater) Subject: screen damage I use X11 a lot on sprite, and it is getting more difficult to use, because now every few seconds there is an "assault -recovery" message which damages the display. I don't know if this is already known, so thought I'd better report it. Log-Number: 31028 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 10 May 1991 11:59:16 PDT Subject: Re: screen damage You need to cat the syslog device /dev/syslog into a window. Otherwise it goes to the console which screws up your screen. I have use the following line in my .xsetup file: tx -title /dev/syslog =80x9-0+0 ${DISPLAY} -e cat /dev/syslog Give it a try. John Log-Number: 31029 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 10 May 1991 12:29:55 PDT Subject: update semantics The update command has kind of funny semantics. If I say "update /foo1 /foo2 /bar" it will make /bar/foo1 and /bar/foo2. If I say "update /foo1 /bar" it will take the contents of /foo1 and put them in /bar. I can't figure out any way to get it to make /bar/foo1. Thus the behavior of update is dependent on the number of command line options. It seems to me that the latter behavior should be specified via a command-line option. On the other hand, the current behavior of "update /foo1 /bar" is similar for files and directories, which is why it was probably done this way in the first place. Any comments? I hate to add the option because lots of scripts will probably break. John Log-Number: 31031 Date: Fri, 10 May 91 13:17:04 PDT From: ouster (John Ousterhout) Subject: Re: update semantics In the distant past we had many arguments about how to do the arguments for update, and we eventually settled on the current scheme as the best among many imperfect alternatives. I'd vote against changing it without first doing a very thorough analysis of the alternatives and their failure modes. John's current problem can be solved with the command "update /foo1 bar/foo1", I think. -John- Log-Number: 31033 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 10 May 1991 17:02:19 PDT Subject: migration questions > From jclee Fri May 10 16:10:56 1991 > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA670786; Fri, 10 May 91 16:10:54 PDT > Date: Fri, 10 May 91 16:10:54 PDT > From: jclee (James C. Lee) > Message-Id: <9105102310.AA670786@sprite.Berkeley.EDU> > To: jhh > Subject: migration question > > John, > > I'm using pmake to run 4 simulation processes in background priority: > > pmake -b -R -L 1 -f sim.s0.p1.b32.id1k.m10.t1 > > I get the following message: > > JobFlagForMigration: warning: eviction of process 63c0c apparently did not complete. > > And I look at the process in question and it's still running on the machine > it migrated to. Runnin "rup" indicates that the machine it migrated to is > "inuse." 2 questions: > > 1. Apparantly the process didn't get evicted automatically, is this a bug? > > 2. If the process stays on the "migrated" system, with the above pmake > flag (-b), would it affect the owner of the system? Right now one of my > process is on loiter--I think it's your machine. Do you notice any > performance degradation? And if so, is there a flag I can specify to > make sure that the processes remigrate in a nice fashion, automagically? > > Thanks. > > James > Log-Number: 31041 From: Fred Douglis <douglis@cs.vu.nl> Subject: migration questions Date: Sun, 12 May 91 13:08:14 +0200 The "don't migrate" flag is set when a process is a pdev master, or (perhaps?) when it shares its heap with another process. Once set it's never cleared. There's clearly a bug lurking around if the migration daemon attempts to migrate an unmigratable foreign process. As for processes being unmigratable in the first place, it's just a matter of implemention, right? :-) Another possibility, as a work-around, would be to evict foreign processes when they do something that would make them unmigratable, and then pin them to their home machine rather than a foreign machine. Of course, without the hooks to notify another process that the eviction has occurred, you could wind up with some load-balancing problems. Fred Log-Number: 31034 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 10 May 1991 17:16:00 PDT Subject: more on migration questions Jclee is running simulations using pmake with the "-b" for background flag. As his mail indicated there might be a problem here. Currently one of his jobs is running on my machine, and I'm definitely using the machine. His process cannot be evicted and appears to be a local job even though it isn't. John Log-Number: 31035 Subject: Re: more on migration questions Date: Fri, 10 May 91 17:29:22 PDT From: Mike Kupfer <kupfer> I noticed last week or the week before a migrated job on arson that had the "don't migrate" flag set, so it couldn't migrate home. I couldn't figure out how the flag had gotten set, and the process eventually died when I put arson into the debugger to figure out what was going on. One thing to note is that once the "don't migrate" flag is set, it is never cleared. mike Log-Number: 31036 Date: Sat, 11 May 91 09:47:55 PDT From: gibson (Garth Gibson) Subject: nulls in my mail file I was about to delete my (infrequently read) Sprite mail file. It reports: forgery 87> mail Warning: encountered nulls at 1760368. Mail spool file may be damaged. "/sprite/spool/mail/gibson": 1803 messages 898 new 1729 unread & x If you care to examine, feel free. When you get back to me, I'll delete this file. ALSO, please take my name off all sprite aliases except spriteusers, raid, spurretro, and xprs thanks garth Log-Number: 31037 Date: Sat, 11 May 91 15:00:11 PDT From: eklee (Edward K. Lee) Subject: ds5000 binary compatibility problem We're trying to run the workview binararies for ds5000's on Sprite and have run into a problem. We are in a hurry and would appreciate it if you could look into this soon. Do the following to duplicate the problem: -------------- forgery% su eklee forgery% cd /r3/raid/viewlogic/raidII/xbus forgery% check -s xbuspullup CHECK - V3.25; Workview 4.0a, 8000 Series Copyright (c) 1990 by Viewlogic Systems, Inc. Unable to generate directory ./sch/ Unable to generate directory ./sym/ Unable to generate directory ./wir/ Unable to generate directory ./sch/log/ Unable to generate directory ./sym/log/ Unable to generate directory ./sch/bac/ Unable to generate directory ./sym/bac/ forgery% -------------- Note that the directories it tried to create already exist. You should also get a bunch of error mesages to the syslog: -------------- MachUNIXGetDirEntries: Bad directory format MachUNIXGetDirEntries: Bad directory format MachUNIXGetDirEntries: Bad directory format MachUNIXGetDirEntries: Bad directory format MachUNIXGetDirEntries: Bad directory format MachUNIXGetDirEntries: Bad directory format -------------- Near line 1679 of machUNIXSyxcall.c, which prints the above syslog message, there is a rather odd comparison: -------------- if (dirPtr->nameLength <= FS_MAX_NAME_LENGTH) { printf("MachUNIXGetDirEntries: Bad directory format\n"); } -------------- At first I thought the comparison was at fault, but it doesn't do anything but print a message. Ed Log-Number: 31039 Date: Sat, 11 May 91 18:21:10 PDT From: eklee (Edward K. Lee) Subject: getdirentries getdirentries currently just performs Fs_Read and returns the result. This doesn't work accross machines with differnt byte orders. Ed P.S. This does not solve the earlier binary compatibility problem I reported earlier. Log-Number: 31040 Date: Sat, 11 May 91 22:41:08 PDT From: eklee (Edward K. Lee) Subject: X binary compatibility problems on ds5000 It works on ds3100's but not on ds5000's. I tried running workview on forgery and displaying on basil but that also did not work. I can run the unix version of workview on basil and display on forgery, however. Ed ---------- forgery% su eklee forgery% workview <...> Error: Can't open X display: forgery.Berkeley.EDU:0.0 Workview 8000 - V4.0a forgery% Log-Number: 31042 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 12 May 1991 10:06:08 PDT Subject: decstation binary compatibility Here is a followup message from Ed about the binary compatibility problems on the decstations. > From eklee Sat May 11 17:54:39 1991 > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA397354; Sat, 11 May 91 17:54:37 PDT > Date: Sat, 11 May 91 17:54:37 PDT > From: eklee (Edward K. Lee) > Message-Id: <9105120054.AA397354@sprite.Berkeley.EDU> > To: jhh@sprite.Berkeley.EDU > Subject: Re: ds5000 binary compatibility problem > > The same problem occures on ds3100. > It appears to be a byteswapping problem. > >From a decstation, the getdirentries system call works on file systems exported > by decstations but not those exported by sparcstations or sun4's. > > Ed > Log-Number: 31043 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 12 May 1991 17:44:47 PDT Subject: bug w/ interaction of fscheck, lfs I've fixed a bug in fscheck that caused it to crash if the 'a' partition of a disk contained an LFS. For some reason fscheck would read the label off the 'a' partition. Because the 'a' partition contained an LFS it wouldn't be able to find the domain header and it would crash. The real bug is that it shouldn't use the 'a' partition but should use whatever parition it is checking. I didn't change this when I rewrote fscheck because I didn't know why it was done. I still don't know why it is done but I got rid of it anyway. This means that the copy of the disk label in each partition must all be identical. The old fscheck might have been more robust if someone changed only the first label and not the copies, but if this happens I'm sure that other things will break anyway. Besides, fscheck is on its way to retirement. John Log-Number: 31045 Date: Sun, 12 May 91 18:56:32 PDT From: mendel (Mendel Rosenblum) Subject: Re: Lfs killed allspice > Allspice died with: > SCSI #3 DMA bus error > Lfs error on /sprite/src/kernel status 0x1 bad lfsStable MemBlockHdr > I took a core: vmcore.lfs The problem here is not in LFS but in the SCSI HBA hardware or driver. The HBA is aborting the LFS read operation with a DMA bus error operation. This appears to happen when the system is doing much I/O such as during fscheck the disk. It appears that LFS can also trigger the condition. We need to either fix this or put in a patch to retry the operation that gets aborted. Mendel Log-Number: 31046 Date: Sun, 12 May 91 19:07:10 PDT From: eklee (Edward K. Lee) Subject: Re: kdbx on ds5000 fixed >>From jhh Sun May 12 18:34:39 1991 >>From: jhh@sprite.Berkeley.EDU (John H. Hartman) >>Date: Sun, 12 May 1991 18:34:38 PDT >>X-Mailer: Mail User's Shell (7.1.1 5/02/90) >>To: sprite >>Subject: kdbx on ds5000 fixed >>I fixed the problem where you couldn't run kdbx on a ds5000. There >>was an ifdef ds3100 in bootcmds that I overlooked. >>As an aside, I'm wondering whether we should add an environment >>variable named OS that would be set to the type of operating system >>that the machine should emulate. Having it would clean up lots >>of ifdefs, and would allow people to have $OS in their paths once >>we start using binary versions of commands.. It might even make >>sense to have it understood by the filesystem as is MACHINE. Would >>this make it easier to do the binary compatibility? >>John This was the same bug that prevented workview from opening the X display. You can ignore my earlier bug report. John, what exactly was ifdef'ed out? Ed Log-Number: 31047 Date: Sun, 12 May 91 20:38:14 PDT From: shirriff (Ken Shirriff) Subject: Allspice had Fsio_FileCloseInt problem Allspice timed out for a couple minutes and then did consistency stuff and came back. It had the mysterious message: Fsio_FileCloseInt: almost returned FS_FILE_REMOVED w/handle locked on the console. Log-Number: 31049 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 12 May 1991 20:53:07 PDT Subject: Re: Allspice had Fsio_FileCloseInt problem I put this message in when I fixed the bug whereby Proc_ServerProcs were leaving handles locked. I wanted to verify that the race did indeed exist. Any time the message is printed it means that the old kernel would have wedged. There is a comment about all of this in the code, but I suppose it is time to remove it anyway since the bug seems to be fixed. John Log-Number: 31050 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 13 May 1991 10:03:57 PDT Subject: handle not locked bug Anise died due to the handle not locked bug. I think this one is increasing in frequency and will soon become a big problem. John Log-Number: 31051 Subject: infinite loop crons Date: Mon, 13 May 91 12:18:54 PDT From: Mike Kupfer <kupfer> When I came in today, there were a couple crons running in an infinite loop on sage. "kill -DEBUG" had no effect on them, so I nuked them ("kill -9"). Shortly after that I noticed a couple messages in sage's syslog, which may or may not be relevant: Warning: VmSwapFileRemove: Fs_Remove(/swap/33/137) returned 4000c. Reopening swap directory. Warning: VmSwapFileRemove: Fs_Remove(/swap/33/29) returned 4000c. Reopening swap directory. mike Log-Number: 31052 From: tve (Thorsten von Eicken) Subject: Re: infinite loop crons Date: Mon, 13 May 91 12:20:18 PDT I had one on crackle too. TvE Log-Number: 31053 Subject: allspice hung doing consistency on spritehosts Date: Mon, 13 May 91 15:20:09 PDT From: Mike Kupfer <kupfer> I rebooted allspice around 1300 (1pm) because several clients had stuck RPCs and I couldn't log in on the console (as root). Investigating the core dump shows that the hung RPCs were all waiting for someone to finish a consistency check on /etc/spritehosts. (My notes from doing L1-p show that a couple user processes on allspice were also waiting on the same thing.) >From what I can tell from the code, there is a one-minute timeout on consistency calls. I don't remember how long I waited before putting allspice into the debugger, but it was at least long enough to walk between the machine room and 608-2 a couple times. What happens when the call times out? Or, more precisely, what happens to future calls that reference the file that's being checked (/etc/spritehosts in this case)? Is there a one-minute timeout for each RPC? There were messages on allspice's console saying that it was waiting for a hung RPC to arson; there were also messages in arson's syslog saying that it was waiting for allspice. I don't know whether there was really a distributed deadlock or whether the system had simply become temporarily constipated. If anyone has suggestions for additional things to look for in the core dump, please send me mail. mike Log-Number: 31054 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 13 May 1991 15:33:44 PDT Subject: Re: allspice hung doing consistency on spritehosts There is a one-minute timeout per RPC. Note that this is actually one minute between the server doing a consist RPC to the client, and when the client does a consist reply rpc to the server. Thus it can take quite a while if there are a lot of clients to timeout on. John Log-Number: 31055 Date: Wed, 15 May 91 12:01:35 PDT From: bmiller (Bob Miller) Subject: "./xgone" message Am I the only one getting this message in my console window? "open of "./xgone" wating for recovery" What's causing this????? Log-Number: 31056 Subject: Re: /graphics Date: Wed, 15 May 91 14:08:30 PDT From: Mike Kupfer <kupfer> > Date: Tue, 14 May 91 02:44:51 -0700 > From: root@bezier.Berkeley.EDU (System PRIVILEGED Account) > To: root@sprite.Berkeley.EDU > Subject: /graphics > > I'm having trouble mounting /graphics. It used to work (up until yesterday) > The error I'm getting is: > assault:/graphics server not responding: port mapper failure - rpc timed out This should be working now. Let us know if it fails again. [For the bugs list: the unfsd on assault was gone when I checked Tuesday afternoon. I restarted it but continued to get the "port mapper failure" message when I tried to mount /graphics on ginger. Rather than manually restart the portmapper (and all the Sun RPC programs that use it), I rebooted assault.] mike Log-Number: 31057 Date: Wed, 15 May 91 14:46:42 PDT From: mottsmth (Jim Mott-Smith) Subject: Starting X on Coons I can't seem to start X on Coons. Typing 'xinit' generates couldn't open /dev/mouse and terminates. Does Coons have a special board or something that might be causing this? -- Jim M-S Log-Number: 31060 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 15 May 1991 15:27:01 PDT Subject: Re: Starting X on Coons Coons has a graphics accelerator that we don't support. John Log-Number: 31061 Subject: old pmake on blackmail Date: Wed, 15 May 91 16:56:48 PDT From: Mike Kupfer <kupfer> It looks like the pmake on blackmail is old. It doesn't seem to understand expressions with "==" in them, and the strings in it are different from the strings in the sun4 pmake. Not recognizing "==" means that the new "settm.mk" breaks. Do we care? mike Log-Number: 31064 From: mgbaker (Mary Gray Baker) Subject: cached attribute update problem Date: Wed, 15 May 91 19:40:52 PDT If you add or remove a file in a directory, it seems that the data modify time on the directory and the size of the directory should change. But it doesn't, at least not for a while. The problem is that the update is made to the descriptor, but not to the cached attributes. Can anyone think of a good reason why I can't just add the update to the cached attributes too? This would fix a couple of problems I've been having with Sprite. Mary Log-Number: 31065 Date: Thu, 16 May 91 01:45:59 PDT From: Dean Long <dlong@@> Subject: weird prefix behavior Sprite lets you mount a prefix on top of a directory (not just special files created by ln -r). For example, I can have a directory /sprite with a sub-directory /sprite/cmds.sun4, and then mount a /sprite prefix on top of the /sprite directory on the / partition. Now, I can access either one. To access the prefix, I can use /sprite, and to access the sprite directory of /, I can use /./sprite. Now the fun part comes with you do "cd .." from /./sprite/cmds.sun4 -- infinite loop -- your shell hangs, and you have to kill -KILL the whole thing. dl Log-Number: 31066 Date: Thu, 16 May 91 08:32:45 PDT From: bmiller (Bob Miller) Subject: followup to "./xgone" message for what it's worth, the './xgone waiting for recovery' message only seems to show up while I'm here. I scrolled back through the console window and found that it never appeared from around 5 p.m. last night until 7:45 a.m. this morning. Log-Number: 31068 Date: Thu, 16 May 91 13:19:00 PDT From: Dean Long <dlong@oak.ucsc.edu> Subject: prefixes without a / If I do something like this: prefix -M /dev/rsd01a -l root_back and forget to put a / in front of root_back, it still gets mounted, but I cannot access it, or unmount it (at least, I haven't figured out how yet.) dl Log-Number: 31058 Date: Wed, 15 May 91 14:56:41 PDT From: eklee (Edward K. Lee) Subject: nfsmount problem The last two problems with ginger and assault (once yesterday and once today (5/15) at 2:50pm) were caused by performing the following actions (I didn't realized until now that this was what was causing the problems or I wouldn't have done it the second time.): forgery% su eklee forgery% cd /home/ginger/raid/viewlogic/raidII/xbus forgery% vsm xbusreg We don't need it fixed; I just thought I would report it. Ed Log-Number: 31059 Date: Wed, 15 May 91 15:25:15 PDT From: eklee (Edward K. Lee) Subject: nfsmount problem I believe that this problem is related to the earlier binary compatibility problem I reported about 'check -s xbuspullup' not working. In that case, it was because getwd(path) returned the incorrect pathname for remote links (this is a known problem and one that is apparently difficult to fix). I believe that getwd(path) when executed from sprite on an nsfmounted partition causes ginger to die. On the bright side, here one bug that causes ginger to die but leaves Sprite relatively unaffected! Ed Log-Number: 31067 Date: Thu, 16 May 91 12:01:47 PDT From: tve (Thorsten von Eicken) Subject: update from sprite to nfs doesn't work with symbolic links It creates a regular null file instead. This is running update on either a ds3100 or a sun4. TvE Log-Number: 31069 Date: Thu, 16 May 91 15:00:25 PDT From: tve (Thorsten von Eicken) Subject: executing nfs-mounted files What's the status of that? We really would like to have it. Thanks, TvE Log-Number: 31070 Date: Thu, 16 May 91 16:19:21 PDT From: root (The Sprite God) Subject: nfsmount still gets big assault-6# ps -vw 132 e1954 PID CODSZ CODRS HPSZ HPRS STKSZ STKRS SIZE RSS COMMAND e1954 124 100 21436 9028 8 8 21568 9136 nfsmount boing:/boing/tic /boing/tic ----------------------------------------------------- Total 124 100 21436 9028 8 8 21568 9136 ... just for the records. Log-Number: 31071 Subject: ANSI compatibility (whining) Date: Fri, 17 May 91 18:56:07 PDT From: Mike Kupfer <kupfer> It sure would be nice if we had an ANSI-compatible C library (e.g., one in which "scanf" with "%i" recognizes a hex number correctly). Maybe for spring cleaning we could steal part or all of the BSD C library? mike Log-Number: 31072 Date: Fri, 17 May 91 23:32:34 PDT From: dlong (Dean Long) Subject: prefix prefix needs to be relinked for the same reason as rdate -- it does not read /etc/spritehosts correctly if you have the right combination of blank lines and comments. dl Log-Number: 31075 Date: Sat, 18 May 91 00:33:19 PDT From: Dean Long <dlong@oak.ucsc.edu> Subject: df too Just like rdate and prefix, df needs to be relinked. Any command that accesses /etc/spritehosts needs to be relinked (if it hasn't been relinked more recently than 8/23/89, when Host_Next was changed to allow blank lines) dl Log-Number: 31074 Date: Fri, 17 May 91 23:40:10 PDT From: dlong (Dean Long) Subject: more on mounting prefixes on directories If you mount a prefix on a directory, and then export the prefix, the machine that imports it gets the directory that is "underneath" the prefix. dl Log-Number: 31077 Date: Mon, 20 May 91 16:18:23 -0700 From: sullivan@postgres.Berkeley.EDU (Mark Sullivan) Subject: file system bug I was remotely logged into babylon (the only sprite machine belonging to the postgres group is in another office). Trying to run pmake but it kept failing with strange errors of the form: Object file format error in: regproc.o: bad file magic number Turns out that my disk was 100% full and I couldn't see the "writeback error: disk full" messages on the console because I was remotely logged in. I was doing a big pmake so it took a long time before I noticed that the pmake wasn't working. Now Babylon is in pretty bad shape. The cache seems to be sufficiently full that I can't page in commands (ls,rm,df, and mail all hung). There are messages on the console about trying to recover the command executable files. I'm going to try to reboot babylon to clear out the cache. Isn't there something that can be done to make write() system calls fail once writebacks to the disk start to fail? Mark Log-Number: 31078 Subject: kgdb broken for sun3 Date: Mon, 20 May 91 17:16:48 PDT From: Mike Kupfer <kupfer> The new kgdb (version 3.5) is unable to set a breakpoint on a sun3. It complains that the address it's setting the breakpoint at is illegal. I've renamed files in /sprite/cmds.sun? so that you get the 3.2 kgdb.sun3 (which works). mike Log-Number: 31079 Subject: "type" and "flag" fields Date: Mon, 20 May 91 17:53:34 PDT From: Mike Kupfer <kupfer> A common practice in Sprite is to declare a type or flags field as an integer, with the various possible values "defined below". However, it's not always easy to find the #defines, especially if (1) they or the struct get moved or (2) there are multiple sets of values "defined below". Consider, for example, Fs_FileID, which is defined in user/fs.h. Until 5 minutes ago the comments for "type" said "Defined below", even though the definitions are really in fsio.h. (And comments in fsio.h claimed that the types were defined in fs.h.) I think we should change the Sprite coding conventions to make this problem less likely to occur. I can think of two ways to make the change. The first way is to use higher-level features of C, i.e., enums (for types) and bitfields (for flags). The second way is to use a bit of syntactic sugar and more programmer discipline. So, to take the Fs_FileID example, instead of declaring "type" as int type; /* Defined below. Used in I/O switch, and * implicitly indicates what kind of structure * follows the FsHandleHeader in the Handle. */ we could declare it as Fsio_StreamType type; /* Used in I/O switch, and implicitly * indicates what kind of structure follows * the FsHandleHeader in the Handle. */ where Fs_StreamType is simply a typedef for int. The #defines for stream types would follow immediately after the Fs_StreamType typedef. Thus there are no potentially bogus "defined below" comments, and by putting the defines next to the typedef, the Right Thing is more likely to happen if something gets moved. mike Log-Number: 31080 Subject: Re: problem with mail Date: Tue, 21 May 91 12:00:59 PDT From: Mike Kupfer <kupfer> > Date: Tue, 21 May 91 11:53:56 PDT > From: bertrand (Bertrand Irissou) > Subject: problem with mail > > I am getting the following warning when I startup mail: > mail > Warning: encountered nulls at 110317. Mail spool file may be damaged. > Mail version 5.4 6/29/88. Type ? for help. > > What is that supposed to mean? > > Bertrand We've been having problems with files getting munged occasionally. It sounds like that happened to your mail. You probably want to check for mail messages that got truncated or otherwise damaged, and it might not hurt to send mail to your regular email correspondents, asking if they've sent you anything important recently. Sorry about all the bother this is causing you. mike Log-Number: 31081 Subject: /pcs/vlsi is having problems Date: Wed, 22 May 91 16:17:21 PDT From: Mike Kupfer <kupfer> I was in the machine room to check on raid1 and I noticed a bunch of error messages on assault's console. Warning: SCSI Disk SII#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0x3c 0x58 File blk 1931 phys blk 7724: 5/22/91 12:55:13 broadcast (0) File "(NULL)" <4,0> Write-back failed: DISK ERROR OfsBlockRealloc: Bad descriptor block. Domain=4 block=7724. mike Log-Number: 31086 From: mendel (Mendel Rosenblum) Subject: Allspice ipServer dies / sendmail problem Date: Thu, 23 May 91 11:18:14 PDT Sendmail on allspice was comatose this morning. I suspect it went comatose about the time the following messages appeared in allspice's syslog: <18>May 23 07:51:55 sendmail[40e38]: NOQUEUE: SYSERR: getrequests: accept: invalid argument I did a kill -KILL on the sendmail process and the ipServer on allspice died. Mendel Log-Number: 31089 Subject: dvi2ps versus dvi2ps.new Date: Thu, 23 May 91 14:38:07 PDT From: Mike Kupfer <kupfer> There are to be two versions of dvi2ps in /sprite/src/adobecmds. The dvi2ps.new version doesn't appear to actually be used, however--there aren't any .md directories. Also, the sources in dvi2ps.new appear to be older than the ones in dvi2ps. Can anyone think of a reason for keeping dvi2ps.new? mike Log-Number: 31090 From: tve (Thorsten von Eicken) Subject: Re: dvi2ps versus dvi2ps.new (and dvips vs. dvips.new, and TeX 3.0) Date: Thu, 23 May 91 15:01:27 PDT I don't know about dvi2ps, except that it should be phased out in favor of dvips. The same ".new" situation exists for dvips: I grabbed newer sources which handle postscript fonts (virtual fonts). However I never made the .new the default bacause it seemed to have trouble printing on the old laserwriters. I also have TeX 3.0 in /pcs/tex mostly installed. I never found the disk space in /sprite/src/cmds to install it. I'm going to rm -rf it soon as I'm using TeX from the sww now. If you want TeX 3.0 on sprite grab it now. TvE Log-Number: 31091 Subject: DONT_MIGRATE set on foreign process Date: Thu, 23 May 91 18:59:53 PDT From: Mike Kupfer <kupfer> Mark Sullivan reported that his makes on greed were hanging. The problem seemed to be a job that had migrated to clove. (Both greed and clove are DECstations running 1.091.) Here's the ps output from clove: clove% ps -amM PID STATE FLAGS EVENT RNODE RPID COMMAND d3916 NEW 9902 ffffffff greed 9185a cc -g -I. -I../include ... f391a RWAIT 2102 ffffffff greed e1847 sh -ev f391e SUSP 182 ffffffff greed 184c cc -O -I. -I../include ... First, does anyone know why these jobs get suspended? Is this some sort of load balancing by pmake? This is not the first time I've noticed compilations that were suspended for no apparent reason. Second, the suspended compilation has DONT_MIGRATE set. As you may recall, I reported a couple weeks ago a problem with a migrated simulation job on arson that had DONT_MIGRATE set. Perhaps there should be a paranoia check in DeencapProcState to verify that DONT_MIGRATE isn't set? I continued the suspended cc, and it and the sh went away, leaving the other cc, which was apparently stuck in the middle of migration (REMOTE_EXEC_PENDING, MIGRATING, NO_VM). I tried killing this last cc, but that succeeded only in hanging my shell. clove-3# ps -amM PID STATE FLAGS EVENT RNODE RPID COMMAND d3916 NEW 9902 ffffffff greed 9185a cc -g -I. -I../include ... mike Log-Number: 31093 From: mgbaker (Mary Gray Baker) Subject: pmake installMACHINE_NAME problem Date: Fri, 24 May 91 18:37:43 PDT For user commands on the sun3 or sun4, you can install things by saying pmake installsun4 or pmake installsun3 instead of pmake install TM=sun4 or TM=sun3 But on the ds5000, if I say pmake installds5000 or pmake installds3100, I get the error message --- .BEGIN --- Sorry, the target machine (ds5000) isn't in the list of allowed machines (ds3100 sun3 sun4). exit 1 *** Error code 1 pmake: 1 error Mary Log-Number: 31094 Subject: Re: pmake installMACHINE_NAME problem Date: Fri, 24 May 91 18:40:37 PDT From: Mike Kupfer <kupfer> You probably have an old Makefile that doesn't handle the ds5000->ds3100 mapping correctly. Do a mkmf and try again. mike Log-Number: 31096 Date: Sat, 25 May 91 19:58:24 PDT From: tve@ginger.Berkeley.EDU (Thorsten von Eicken) Subject: anise seems wedged, it doesn't serve /pcs Log-Number: 31097 From: mendel (Mendel Rosenblum) Subject: Re: anise seems wedged, it doesn't serve /pcs Date: Sun, 26 May 91 14:02:19 PDT I rebooted anise. It was running the "new" kernel. Please only boot kernels with the uninstalled lfs module. Following the directions on anise console if you don't have a kernel of your own that you wish to boot. Thorsten, /pcs is currently 92% full. Things will go more smoothly if you don't try to use so much of the disk. Mendel Log-Number: 31098 Date: Sun, 26 May 91 17:27:47 PDT From: shirriff (Ken Shirriff) Subject: Main_InitVars fails on ds3100 I found that kernel variables set in Main_InitVars are soon cleared by Mach_Init, which zeros the bss. I added a second call to Main_InitVars, so the values will be restored. Ken Log-Number: 31099 Date: Mon, 27 May 91 13:08:54 PDT From: sullivan (Mark Sullivan) Subject: more make trouble I'm getting "writeback failed" messages from babylon. Message is as follows: Client command, writeback & invalidate msg to Client 24 file "tmp.makefile" <8,51762> failed 40012 Then many: Client state killed: 0 refs 0 write 0 exec followed by a message like the one above but for a different file. This could be a problem with my kernel (the server for /postdev is running my write-protect vm), but I'm using the installed version of the file system. Mark ps. according to df, babylon has about 170MB of free space so it shouldn't be failing because the disk is full. Log-Number: 31100 From: mendel (Mendel Rosenblum) Subject: Bug fix: deadlock on devDiskStatMutex Date: Tue, 28 May 91 13:18:42 PDT Larceny just deadlocked on the devDiskStatMutex. The problem is that the routine Dev_GetDiskStats() grabs the deviceListMutex to synchronize access to the list of disk devices. For each device it copies the stats into the buffer passed the routine. This buffer points into the user's address space and has had a Vm_MakeAccessible done on it in the Sys_StatsStub() routine. The Vm_MakeAccessible() insures that the buffer is resident in virtual address space but does not insure that the pages are resident in physical memory. If the pages aren't resident the routine Dev_GetDiskStats() gets a page fault with the deviceListMutex held. Since the page fault causes the interrupts to be re-enabled, the callback Dev_GatherDiskStats() can be called. When Dev_GatherDiskStats() tries to grab the deviceListMutex a deadlock occurs. Since all other Sys_Stats type routines used Vm_CopyOut rather than Vm_MakeAccessible() I will fixed this deadlock by changing the disk stats case to use Vm_CopyOut. Mendel Log-Number: 31107 From: mendel (Mendel Rosenblum) Subject: allspice crash with level 15 interrupt error Date: Wed, 29 May 91 13:08:10 PDT Allspice panic'ed this morning with an level 15 interrupt error. The problem was the hardware detected a dirty cache block at a virtual address that did not have a valid PTE. The error happened on context number 2 at address 0x49560. A rlogind created from a rlogin of jhh from loiter was the process loaded at context 2 at the time. I suspect that it is either a bug in the ds5000 port or a problem with fscheck. The address in rlogind had an invalid pmeg loaded and the rlogind was in the middle of reading a page fault from disk into the stack segment. The page fault was caused by Fs_Select trying to Vm_CopyOut. I suspect the rlogind was awaken by a wall message. I was able to take a core dump and continue allspice with out any problems. Mendel Log-Number: 31108 Subject: adduser, deleteuser, and RCS Date: Wed, 29 May 91 14:21:32 PDT From: Mike Kupfer <kupfer> The adduser program doesn't recover correctly if the check-out fails. It proceeds as though the check-out had succeeded. The deleteuser program doesn't even know about RCS. mike Log-Number: 31109 Subject: strangeness when building pmake Date: Wed, 29 May 91 15:38:12 PDT From: Mike Kupfer <kupfer> Suppose I edit pmake/src/parse.c. If I say "pmake", the src/foo.md/linked.o gets rebuilt correctly, but then pmake stops. If I say "pmake" again, the final sun4.md/pmake gets built. Why doesn't it get built with the first invocation of pmake? mike Log-Number: 31111 Subject: fclose(NULL) goes into debugger Date: Wed, 29 May 91 17:07:59 PDT From: Mike Kupfer <kupfer> If you call fclose with a NULL stream handle, you end up in the debugger. (There's also a bug report from Fred about this from last August.) It would be good if we either (1) fix the documentation and comments to reflect this or (2) fix fclose to not drop core. The BSD guys are deliberately letting fclose die, so that's one argument for not changing the code. On the other hand, my reading of K&R implies that fclose should return EOF. mike Log-Number: 31113 Date: Wed, 29 May 91 20:01:19 PDT From: dlong (Dean Long) Subject: SIGIO, FASYNC, sockets asynchronous mode does not seem to be implemented for sockets. dl Log-Number: 31114 Date: Thu, 30 May 91 09:13:34 PDT From: ouster (John Ousterhout) Subject: New gdb.sun4 loses stack information I'm having troubles debugging with the new gdb (version 3.5) for the Sun4. When a segmentation fault occurs, the debugger can only print out the lowest stack frame (it calls this frame #0 and insists there's nothing higher on the stack). Gdb.old works just fine. -John- Log-Number: 31117 From: mendel (Mendel Rosenblum) Subject: Re: New gdb.sun4 loses stack information Date: Thu, 30 May 91 18:26:00 PDT I installed a new gdb for the sun4 that doesn't loses stack frames on John's "bad" example. The problem was that the program was built from library routines in files with names ending with ".go" rather than ".o". This caused gdb to think that all routines were part of the "startup" frame which it doesn't display. Mendel Log-Number: 31118 Date: Thu, 30 May 91 20:24:11 PDT From: Dean Long <dlong@oak.ucsc.edu> Subject: ipServer bug There is a bug in ipServer for UDP packets. If I do a recvfrom(fd, buf,len,...) and a packet arrives that is less than len, recvfrom blocks, instead of returning the packet. It turns out that the optimized UDP_ReadRequest is not working correctly. If you make it call UDP_SocketRead instead, it works fine. dl Log-Number: 31120 Subject: dev/ds3100.md/devTypesInt.h Date: Thu, 30 May 91 21:28:06 PDT From: Mike Kupfer <kupfer> Is this (kernel) file history? The file itself doesn't exist, but there is an RCS file for it. (Do we have a convention for deleting or renaming RCS files for deleted source files?) mike Log-Number: 31121 Date: Fri, 31 May 91 10:47:14 PDT From: tve (Thorsten von Eicken) Subject: move nfs mounts away from assault? Would that make sense, given that assault is being retired? TvE Log-Number: 31122 From: mendel (Mendel Rosenblum) Subject: Profiling broken on sun4 - suspect UNIX compatibility mode Date: Fri, 31 May 91 10:55:19 PDT Sun4 object files linked with the -p or -pg flags seg fault upon startup in monstartup(). It appears that the data segment gets invalidated. The following messages appear in the syslog: Executing UNIX file in compatibility mode. Moving stack pointer for Unix binary. MachPageFault: Bus error in user proc d1250, PC = 397c, addr = 40ec0 BR Reg 8080 Mendel Log-Number: 31123 Subject: "man -i" is too picky about NAME Date: Fri, 31 May 91 17:10:11 PDT From: Mike Kupfer <kupfer> I ran the man page indexer by hand and noticed a bunch of complaints like ``Couldn't find "NAME" section in "cb.man".'' Some of these are because the man page is already formatted text, rather than troff input. Okay, fine. Some of the complaints seem to result from having a blank at the end of ``.SH NAME ''. This seems unnecessarily picky to me. It also doesn't like ``.SH "NAME"'', which appears in a CMU man page that I imported. mike Log-Number: 31124 Subject: /swap filesystem error messages at reboot Date: Sun, 02 Jun 91 12:20:44 PDT From: Mike Kupfer <kupfer> When I rebooted allspice, I noticed a bunch of error messages of the form file swap/14/243 references non-allocated descriptor 114360. File Deleted. Entry 243 now has nameLength 3 recordLength <some integer>, fileNumber 0. Is this normal? mike Log-Number: 31127 Subject: deleteuser doesn't always edit aliases file correctly Date: Mon, 03 Jun 91 14:04:20 PDT From: Mike Kupfer <kupfer> The deleteuser command doesn't always leave the aliases file in good shape after removing a user. The basic problem is that there are many different ways a name can be on a list (e.g., last name in the list, last name on a line, only name on a line, etc.), and deleteuser doesn't get them all right. (Rather than working directly on the text representation, it should probably read each alias into a linked list or something, remove the user name wherever it appears, then write the list back out. How hard would it be to do this in Tcl?) I'm not real enthusiastic about trying to fix this right now, unless it would be well-suited for Tcl, in which case I've been looking for a Tcl learning exercise anyway. The other options are: (1) leave in the current set of bugs, which crop up if the user name is the only one on a line or is the last one in the list. (2) bring up an editor, so that the administrator can edit the aliases file manually, rather than having deleteuser do it. What say you? mike Log-Number: 31130 Date: Tue, 4 Jun 91 11:03:04 PDT From: shirriff (Ken Shirriff) Subject: Pmake can't install sun4 script on ds3100 If I'm on a ds3100, I can't do pmake TM=sun4 install to install a shell script, because it says "you cannot compile for a sun4 on this machine". It should only give this message if the install involves compilation. Log-Number: 31133 Subject: SCSI bus error made LFS panic (allspice crash) Date: Tue, 04 Jun 91 13:20:02 PDT From: Mike Kupfer <kupfer> Allspice got a SCSI bus error and LFS panicked at around 1245 today. Mendel took a core dump and we rebooted. mike Log-Number: 31134 Subject: hidden machine dependencies Date: Tue, 04 Jun 91 13:32:16 PDT From: Mike Kupfer <kupfer> Do we have a list of things in Sprite that have machine dependencies? I'm thinking about things like: - directories that have $MACHINE links - scripts (e.g., bootcmds) that have special cases for different machine types - system makefiles that have special cases for machine types (I just found yet another place--the pmake library--that has ds3100 special-case stuff and needs to know about ds5000's.) mike Log-Number: 31135 Date: Tue, 4 Jun 91 14:46:11 PDT From: kupfer@ginger.Berkeley.EDU (Mike Kupfer) Subject: allspice crash: Fscache_RemoveFileFromDirtyList allspice panicked with "Fscache_RemoveFileFromDirtyList blocks in cache". I took a core and will poke around. mike Log-Number: 31136 Date: Tue, 4 Jun 91 13:50:20 PDT From: kupfer@ginger.Berkeley.EDU (Mike Kupfer) Subject: Another SCSI bus error brought down allspice Allspice just crashed again. There was a SCSI bus error, then LFS panicked trying to access /swap1. mike Log-Number: 31138 Date: Wed, 5 Jun 91 00:21:51 PDT From: elm (ethan miller) Subject: mail to cory hall sprite Is there any good way to have mail sent to Cory Hall sprite forwarded back to allspice? I tried putting an entry into the aliases file on king, but that didn't seem to work. Is there any (easy) way to kludge the sendmail.cf file to do this? thanks ethan Log-Number: 31141 Subject: allspice crashes unlocking unlocked handle Date: Wed, 05 Jun 91 17:03:42 PDT From: Mike Kupfer <kupfer> The allspice problems this afternoon were apparently caused by somebody's current working directory being deleted, leading to the now-famous "unlocking an unlocked handle" panic. These panics are continuable, once you knock the relevant client into the debugger. mike Log-Number: 31142 Subject: structure of setjmp.h Date: Wed, 05 Jun 91 21:35:37 PDT From: Mike Kupfer <kupfer> /sprite/lib/include/setjmp.h is currently a symbolic link to $MACHINE.md/setjmp.h. It might be better if it were broken into a machine-independent part (function prototypes) and machine-dependent part (size of jmp buf, etc.). mike Log-Number: 31144 From: mendel (Mendel Rosenblum) Subject: Re: Murder's tape messed up Date: Thu, 06 Jun 91 10:52:03 PDT > Return-Path: shirriff > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA600596; Thu, 6 Jun 91 10:46:09 PDT > Date: Thu, 6 Jun 91 10:46:09 PDT > From: shirriff (Ken Shirriff) > Message-Id: <9106061746.AA600596@sprite.Berkeley.EDU> > To: bugs > Subject: Murder's tape messed up > > The tape drive on murder gives me "wait on PHASE_COMMAND failed" and other > SCSI errors. I tried power cycling murder and the tape but that didn't help. Try using a different SCSI cable. Mendel Log-Number: 31146 From: mendel (Mendel Rosenblum) Subject: bug fix: directory deleting unlocking unlock handle problem Date: Fri, 07 Jun 91 13:21:43 PDT > > The allspice problems this afternoon were apparently caused by > somebody's current working directory being deleted, leading to the > now-famous "unlocking an unlocked handle" panic. > > These panics are continuable, once you knock the relevant client into > the debugger. > > mike The uninstalled bug fslcl module contains a bug fix for the above problem. This fix avoids the panic with deleting a directory tree that another process is "cd" into. There were two problems: 1) The <name, directory> to file handle hash table entry for a directory's ".." was being deleted when the directory was deleted. This entry should really be deleted when the directory's name is unlinked from its parent. This only caused problems when the directory was unlinked but not deleted because it was still opened. I changed the code to delete the hash table entry when the name was unlinked from it's parent. 2) The FindComponent routine sometimes unlocked a handle it was passed and sometimes it didn't. I changed the code so the lock status of the handle could be determined from the return status (SUCCESS or !SUCCESS). This allows the caller to avoid the unlocking of the unlock handle. Mendel Log-Number: 31147 Date: Fri, 7 Jun 91 15:01:11 PDT From: bmiller (Bob Miller) Subject: adduser I'm trying to add a new user, but running into a little difficulty. The message I get is: Permission denied. Could't fetch entry from thalm Make sure your machine is listed in /.rhosts I've checked and thalm is operating. Any ideas as to what the problem might be? Bob Log-Number: 31148 Date: Sat, 8 Jun 91 14:13:47 PDT From: ouster (John Ousterhout) Subject: Lust reboot Lust was refusing to talk to Tyranny this afternoon. When I went upstairs to take a look it seemed to respond to the console but was recovering with Allspice every few seconds. In retrospect I should have called the DDJ, but I assumed no-one else was around so I just rebooted it. -John- Log-Number: 31149 Date: Sun, 9 Jun 91 18:05:51 PDT From: dlong@dogwood.ucsc.edu (Dean Long) Subject: Possible problem with RPC I have the following in /boot/bootcmds: rpccmd -negAcksOn -channelNegAcksOn -numNackBufs 2 and got a few panics saying: Rpc_ChanAlloc can't find the free channel. from rpc/rpcCall.c Log-Number: 31150 Date: Mon, 10 Jun 91 10:05:18 PDT From: schauser (Klaus Erik Schauser) Subject: rmail on Sun3 I used rmail under emacs on paprika (a Sun3). It read the mail from /usr/spool/mail/schauser and then went into the debugger. Ken told me that I could recover the mail from ~/.newmail. Adam Dingle (also 444E) told me that he had lost mail several times on hoot (also a Sun3) because rmail crashed. Klaus ***************************************************** Here the session: You have new mail. paprika:/pcs/schauser> emacs & [1] 10b11 paprika:/pcs/schauser> [1] + Segmentation violation emacs paprika:/pcs/schauser> ps PID STATE TIME COMMAND 10b11 DEBUG 0:06 emacs 30b4a WAIT 0:04 -csh 10b13 EXIT 0:00 /emacs/cmds/movemail /usr/spool/mail/schauser ... 10b14 RUN 0:00 ps paprika:/pcs/schauser> more /usr/spool/mail/schauser paprika:/pcs/schauser> ls -alr /usr/spool/mail/schauser -rw------- 1 schauser 0 Jun 10 09:44 /usr/spool/mail/schauser paprika:/pcs/schauser> Log-Number: 31153 Date: Mon, 10 Jun 91 14:12:25 PDT From: tve (Thorsten von Eicken) Subject: problems with troff_p [crackle doc] ditroff -ms weiss /sprite/cmds.sun4/troff_p: Can't open /sprite/lib/ditroff/devpsc/i.out; line 20, file weiss [In ~tve/doc, this seems to be new?] Log-Number: 31157 Date: Mon, 10 Jun 91 17:55:35 PDT From: shirriff (Ken Shirriff) Subject: Re: problems with troff_p >/sprite/cmds.sun4/troff_p: Can't open /sprite/lib/ditroff/devpsc/i.out; line 20 The problem is in line 20 you have \fiAlixandre\fR instead of \fIAlixandre\fR. Since there is no font "i", troff complains. (Old troff does the same thing, except it doesn't tell you the line number.) Ken Log-Number: 31154 From: mendel (Mendel Rosenblum) Subject: Allspice hangup - timer interrupts quit Date: Mon, 10 Jun 91 14:20:14 PDT Allspice hungup just now. It appears to be the problem where the timer callback queue is no longer being processed. I l1-A'ed and continued it and the hangup cleared up. Mendel Log-Number: 31155 Date: Mon, 10 Jun 91 14:41:00 PDT From: ouster (John Ousterhout) Subject: Pmakes hung Pmakes don't seem to be working without the "-X" switch: they seem to hang up trying to talk to some machine like terrorism. Anybody (e.g. DDJ?) have any ideas how to unwedge them? -John- Log-Number: 31156 From: mendel (Mendel Rosenblum) Subject: Deadlock with migration and recovery Date: Mon, 10 Jun 91 15:22:50 PDT Terrorism was hanging migrations to it because of the following deadlock. Process 0x20 - A pmake from tyranny was being migrated off terrorism. The call stack looked like: (gdb) where #0 0xf600c6d0 in Mach_ContextSwitch () #1 0xf60ad180 in SyncEventWaitInt (...) (...) #2 0xf60abe64 in Sync_SlowWait (...) (...) #3 0xf60bdfb8 in VmPageFreeInt (...) (...) #4 0xf60be018 in VmPageFree (...) (...) #5 0xf60bca10 in FreePages (...) (...) #6 0xf60bbd0c in Vm_EncapState (...) (...) #7 0xf608bc6c in Proc_MigrateTrap (...) (...) #8 0xf60ab390 in Sig_Handle (procPtr=(struct Proc_ControlBlock *) 0xf63bb4d0, sigStackPtr=(Sig_Stack *) 0xf63be540, pcPtr=(char **) 0xf805fe3c) (signals.c line 1223) #9 0xf600ea7c in MachUserAction (...) (...) #10 0xf6010978 in MachReturnFromTrap () The routine Proc_MigrateTrap() locks the process table entry for the current process before calling Vm_EncapState(). The routine VmPageFreeInt() is waiting for recovery on allspice so it can write out a dirty page of the data segment. This wakeup never happens because the Proc_ServerProc doing the recovery with allspice has a stack that looks like: #0 0xf600c6d0 in Mach_ContextSwitch () #1 0xf60ad180 in SyncEventWaitInt (event=4131108156, wakeIfSignal=0) (syncLock.c line 655) #2 0xf60abe64 in Sync_SlowWait (conditionPtr=(struct Sync_Condition *) 0xf63bb53c, lockPtr=(struct Sync_KernelLock *) 0xf6160bd8, wakeIfSignal=0) (syncLock.c line 298) #3 0xf6095bc0 in Proc_Lock (procPtr=(struct Proc_ControlBlock *) 0xf63bb4d0) (procTable.c line 416) #4 0xf608f8fc in Proc_WakeupAllProcesses () (procMisc.c line 988) #5 0xf605c53c in Fsutil_Reopen (...) (...) #6 0xf609aa30 in RecovRebootCallBacks (data=(ClientData) 0xe) (recovery.c line 1153) #7 0xf6094e8c in Proc_ServerProc (...) (...) #8 0xf60a83e8 in Sched_StartKernProc (...) (...) While trying to do a Proc_WakeupAllProcesses(), it hit the locked process table entry from the migrate and hung. This left terrorism hung in recovery with allspice. I rebooted terrorism. Mendel Log-Number: 31158 Date: Tue, 11 Jun 91 16:49:49 -0700 From: margo@postgres.berkeley.edu (Margo Seltzer) Subject: mkmf hanging during makedepend This doesn't seem to happen deterministically, but approximately 1 of 5 times, mkmf hangs during the makedepend phase. Killing the process and reissuing the mkmf seems to work. Log-Number: 31159 Date: Tue, 11 Jun 91 20:26:30 PDT From: sullivan (Mark Sullivan) Subject: xgraph bug and fix If one axis uses a log scale, the labels on the tick marks on that axis are printed out using the wrong printf format. The version of xgraph on our (Ultrix) Decstations, prints the labels out in scientific notation format and the version on Sprite prints it in normal floating point format. The result is that logarithmic scale axes on Sprite have labels with lots and lots of zeros in them. The bug is in xgraph.c in the xgraph source directory. In a routine called "WriteValues", there is a test for the special case of log axes. The routine also tests to make sure that the user hasn't defined his or her own print format: if (logFlag) { if (fmt==DEF_FORMAT) { /* print scientific notation */ } else { /* print using user-defined format */ } } For whatever reason, there are several copies of this default format string around, so the pointer equality comparison fails. The conditional should be: if (! strcmp(fmt,DEF_FORMAT)) Mark Log-Number: 31161 Date: Wed, 12 Jun 91 23:50:09 PDT From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice crashed -scsi bus error Allspice encountered a scsi bus error and died in Lfs, so I rebooted. Ken Log-Number: 31139 Subject: damaged mail queue: fscheck bug? Date: Wed, 05 Jun 91 15:25:16 PDT From: Mike Kupfer <kupfer> Some excerpts from allspice's syslog: <18>Jun 4 15:11:54 sendmail[40e3c]: AA462430: SYSERR: qfAA462430: line 1: readqf(AA462430:1): bad line "ble (56) completed client recovery": file already exists <18>Jun 4 15:11:54 sendmail[40e3c]: AA462430: SYSERR: qfAA462430: line 2: readqf(AA462430:2): bad line "FsrmtFileVerify, no handle <1,-79978> client 56" <18>Jun 4 15:11:54 sendmail[40e3c]: AA462430: SYSERR: qfAA462430: line 3: readqf(AA462430:3): bad line "6/4/91 14:18:57 burble (56) initiating recovery" <22>Jun 4 15:11:55 sendmail[40e3c]: AA462430: to=unit, delay=00:00:01, stat=User unknown <22>Jun 4 15:12:00 sendmail[40e3c]: AA462430: to=recv, delay=00:00:06, stat=User unknown <22>Jun 4 15:12:11 sendmail[40e3c]: AA462430: to=einit, delay=00:00:17, stat=User unknown <18>Jun 4 15:12:11 sendmail[40e3c]: AA462430: SYSERR: qfAA462430: line 5: readqf(AA462430:5): bad line "Intel: Spurious interrupt (2)" <18>Jun 4 15:12:11 sendmail[40e3c]: AA462430: SYSERR: qfAA462430: line 6: readqf(AA462430:6): bad line "6/4/91 14:19:01 burble (56) completed client recovery" <18>Jun 4 15:12:11 sendmail[40e3c]: AA462430: SYSERR: qfAA462430: line 7: readqf(AA462430:7): bad line "FsrmtFileVerify, no handle <1,-79978> client 56" <18>Jun 4 15:12:11 sendmail[40e3c]: AA462430: SYSERR: qfAA462430: line 8: readqf(AA462430:8): bad line "6/4/91 14:19:01 burble (56) initiating recovery" Recall that allspice crashed at around 1245, 1345, and 1440 on the 4th. How conservative is fscheck? Are there files that it is claiming to fix, when in fact it should stick them in lost+found so that a user can look at them? While sendmail was bitching, allspice was also going through a recovery loop with burble; I don't if the two are related. mike [if fscheck finds that a block belongs to two files, it leaves it (or a copy?) in each file. Eventually fscheck will go away, so we won't fix this. -mdk, 6/21/91] Log-Number: 31140 Date: Wed, 5 Jun 91 17:01:32 PDT From: dlong (Dean Long) Subject: Net_ArpPacket in netInet.h The field targetProtAddr in the structure Net_ArpPacket will be aligned to a 4-byte address on a sun4, which will break things. Perhaps targetProtAddr could be changed to char[4] instead of unsigned int. dl Log-Number: 31151 From: mendel (Mendel Rosenblum) Subject: Allspice memory usage and crashes during dumps Date: Mon, 10 Jun 91 12:31:57 PDT Here is jhh's and my guess of why allspice runs out of memory during the dumps. Allspice's file cache contains a maximum of 29778 4k blocks. Let's assume that the dump hits a run of files of size <= 4K. It can bring as many as 29778 into the file cache so it will need 29778 Fsio_FileIOHandle of size 264 bytes each. The Fsio_FileIOHandle also points to the descriptor of the file (128 bytes each) and a name (about 32 bytes). All these sizes are binned in the memory allocator so they occupy 29778 * (336 + 136 + 32) = 14.3 megabytes. Because these are binned the space is kept by the memory allocator for future reuse when the objects are freed. Next assume that the dump hits a run of remote files of size <= 4K. It can bring as many as 29778 remote files into the cache that need a structure Fsrmt_FileIOHandle around. Fsrmt_FileIOHandle are binned at size 280 with complete pathname also malloc'ed (avg 40 bytes). 29778 * (280 + 4) = 9.1 megabytes. So after dumping a local and remote disk the memory allocator has 9.1 + 14.3 = 23 megabytes in use for binned handles. The current kernel memory limit is 32 bytes for everything. Mendel Log-Number: 31152 Date: Mon, 10 Jun 91 13:12:10 PDT From: ouster (John Ousterhout) Subject: Re: Allspice memory usage and crashes during dumps Can we just increase the Allspice memory limit a bit, say to 40 Mbytes? -John- Log-Number: 31163 From: mendel (Mendel Rosenblum) Subject: Re: "got a debugger packet from" lacks address Date: Thu, 13 Jun 91 14:05:41 PDT > Return-Path: kupfer > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA663877; Thu, 13 Jun 91 12:22:39 PDT > Message-Id: <9106131922.AA663877@sprite.Berkeley.EDU> > To: bugs > Subject: "got a debugger packet from" lacks address > Date: Thu, 13 Jun 91 12:22:38 PDT > From: Mike Kupfer <kupfer> > > I noticed that when Ed Lee did a "kmsg -d" to put raid1 into the > debugger, the following message appeared on raid1's console: > > *** Got a debugger packet from *** > > Is there supposed to be an ethernet address in that message or > something? > > mike The problem is that due to a misunderstanding between the kmsg program and the Sprite kernel on the format of the debug packet. The kmsg program sends the packet in native machine format while the kernel assumes the packet is in sun3 format. The kmsg program defines a debug packet to be: struct { Net_EtherHdr etherHdr; int nameLen; int name[100]; } while the kernel assume that a packet looks like a Net_EtherHdr followed by a nameLen, followed by the name. Because the Net_EtherHdr is 14 bytes, a structure defined as above will have 2 bytes of padding between the etherHdr and nameLen field. This padding occurs on sparc, and mips, but not 68K's. So only the sun3 can send properly formatted Sprite debug packets. If this problem is fixed we will hit a second problem in that nameLen is assumed by the kernel to be in big endian byte order while the ds3100 kmsg sends it in little endian byte order. In terms of doing the "right thing", a DBG_CALL is done by the net module regardless of the nameLen or name so everything works except the printf message. I propose that we hold a special sprite meeting to discuss the possible solutions to this problem. Mendel Log-Number: 31164 Subject: couldn't boot allspice off ginger Date: Thu, 13 Jun 91 22:43:38 PDT From: Mike Kupfer <kupfer> Well, my attempt to reboot allspice with my kernel was a dismal failure. It took 15-20 minutes just to download the kernel, and of course the &*^(%&^$*&^% root partition wouldn't fscheck cleanly, so it had to reboot a second time. By this time it was 19:15 or 19:30, and Mendel and Mary had joined the party. We eventually gave up and booted "new" off the disk. (begin bitching) Why in Hell do we still have this problem where we can't shut down a system cleanly? And why in Hell is the root partition so big? The bigger it is, the more likely it will fail the disk check, hence the more likely we'll have to reboot. (end bitching) I went through the 1990 bug list and found a couple instances of similar problems, where downloading a kernel would go very slowly. One suggestion from the bug list is that the clients are somehow polluting the net trying to talk to allspice. I tried ftp'ing a copy of the kernel from ginger to shallot and it went at normal speed, so if there is some global problem, why would it only affect the downloading of the kernel? I am suspicious that it might be a problem with the network driver in allspice's PROM. Etherfind shows a pattern very similar to that exhibited when murder takes forever to boot: the client asks for a packet, and the server immediately replies. Then there is a two-second pause before the client sends anything else to the server. Unfortunately, etherfind doesn't conveniently give sequence numbers for tftp connections, so I don't know for sure whether the second request is for the block that the server just sent it. Anyway, the reason that we're going through all this fuss and bother in the first place is because of the problems we had last year where we couldn't boot anything except "new" off the disk. Has this been fixed? mike Log-Number: 31165 From: mendel (Mendel Rosenblum) Subject: Re: couldn't boot allspice off ginger Date: Fri, 14 Jun 91 09:00:42 PDT > > (begin bitching) > > Why in Hell do we still have this problem where we can't shut down a > system cleanly? I suspect the problem is that we sync the disk and then kill off all processes. Killing processes can causes files to be deleted from /swap/14 (allspice's swap area) which is on /. This explains the directory /swap/14/{number} pointing to a unallocated inode. > > And why in Hell is the root partition so big? The bigger it is, the > more likely it will fail the disk check, hence the more likely we'll > have to reboot. Too bad you weren't around when I tried to argue against this and lost. > > (end bitching) > > I went through the 1990 bug list and found a couple instances of > similar problems, where downloading a kernel would go very slowly. > One suggestion from the bug list is that the clients are somehow > polluting the net trying to talk to allspice. I tried ftp'ing a copy > of the kernel from ginger to shallot and it went at normal speed, so > if there is some global problem, why would it only affect the > downloading of the kernel? Did you use ftp or tftp? The PROM and netBoot use tftp to down the kernel. > > I am suspicious that it might be a problem with the network driver in > allspice's PROM. Etherfind shows a pattern very similar to that > exhibited when murder takes forever to boot: the client asks for a > packet, and the server immediately replies. Then there is a > two-second pause before the client sends anything else to the server. > Unfortunately, etherfind doesn't conveniently give sequence numbers > for tftp connections, so I don't know for sure whether the second > request is for the block that the server just sent it. It seems more likely to be a problem in netBoot. The PROM network driver is only called to send and receive packets and allspice and murder have different network inferface hardware. The piece in common here is the netBoot program. > > Anyway, the reason that we're going through all this fuss and bother > in the first place is because of the problems we had last year where > we couldn't boot anything except "new" off the disk. Has this been > fixed? I don't remember anyone fixing this. Hook up a disk to anise and you have a good test setup for looking at this problem. Mendel Log-Number: 31172 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 16 Jun 1991 20:00:05 PDT Subject: Re: couldn't boot allspice off ginger > > Why in Hell do we still have this problem where we can't shut down a > system cleanly? Because no one has the time to fix it. Same is true for any Sprite bugs that have a known fix, but still exist. > > And why in Hell is the root partition so big? The bigger it is, the > more likely it will fail the disk check, hence the more likely we'll > have to reboot. > There is a trade-off between having a big and small root. A big root makes it easier to configure the system, ie its easier to make sure you have all binaries, data files, devices, etc. available during the boot. The root is not rechecked, so the only delay is due to the kernel download. Mary's work on fast reboot will eventually eliminate this delay. > I am suspicious that it might be a problem with the network driver in > allspice's PROM. Etherfind shows a pattern very similar to that > exhibited when murder takes forever to boot: the client asks for a > packet, and the server immediately replies. Then there is a > two-second pause before the client sends anything else to the server. > Unfortunately, etherfind doesn't conveniently give sequence numbers > for tftp connections, so I don't know for sure whether the second > request is for the block that the server just sent it. It wouldn't surprise me if there is a problem with the prom. Or netBoot. > > Anyway, the reason that we're going through all this fuss and bother > in the first place is because of the problems we had last year where > we couldn't boot anything except "new" off the disk. Has this been > fixed? See my first response. It is no secret that Sprite has a lot of bugs. It is also no secret that we only have a few people to fix them. These same people are supposed to be making progress towards getting out of here. We have to strike a balance, and it probably means that not all bugs will be fixed. I think all of the problems you mentioned were previously discussed in a Sprite meeting, and we decided not to take action on them. John Log-Number: 31167 Date: Fri, 14 Jun 91 17:13:34 PDT From: ouster (John Ousterhout) Subject: man "NAMES " bug fixed. I've fixed the bug Mike Kupfer reported in log-number 31123, about man barfing unnecessarily when the NAMES section header in a manual entry has an extra trailing space. -John- Log-Number: 31168 Date: Fri, 14 Jun 91 18:28:03 PDT From: shirriff (Ken Shirriff) Subject: Mysterious exabyte errors When I do the weekly dumps, I often get the following errors: |Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: hardware error - info bytes 0x0 0x0 0x1 0x52 |Warning: Exabyte Data Flow Underrun |Warning: Exabyte Formatter error, catastrophic failure! |Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: media error - info bytes 0x0 0x0 0x0 0xaa |Exabyte File Mark Error or |Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: media error - info bytes 0x0 0x0 0x0 0xa3 |Warning: Exabyte maximum write retries attempted |Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: media error - info bytes 0x0 0x0 0x0 0xa2 |Exabyte File Mark Error |Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: hardware error - info bytes 0x0 0x0 0x1 0x52 |Warning: Exabyte Data Flow Underrun |Warning: Exabyte Formatter error, catastrophic failure! |Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: media error - info bytes 0x0 0x0 0x0 0xaa |Exabyte File Mark Error Then the tar fails with "can't write to -: I/O error" Any idea what this means? If I try restarting the dumps on the same tape, they work the second time. Ken Log-Number: 31171 Date: Sun, 16 Jun 91 14:56:47 PDT From: shirriff (Ken Shirriff) Subject: exb1 doesn't work I'm still getting media errors on the exabyte drive attached to allspice after running the cleaning tape through the drive. As there are now no working drives, I am unable to complete the dumps. Ken Log-Number: 31173 Date: Thu, 20 Jun 91 10:04:39 PDT From: ouster (John Ousterhout) Subject: Allspice running out of memory I just had an idea about the problem we've been seeing where Allspice runs out of memory because of high-water marks with local and remote handles. As I remember, the problem is that the dumper reads a large number of remote files, building up a huge list of remote handles, then it reads a large number of local files, building up a huge list of local handles. Since local and remote handles are different sizes, two different and non-interchangeable pools of memory get allocated, thereby wasting memory. How about solving this problem by always using a single size for all handles, local or remote? There's not much difference in size anyway, is there? For example, FS could define a union that contains both local and remote handles and use the size of the union when allocating space for either handle type. This would make it possible for remote handle space to be reallocated for local handles, and vice versa. -John- Log-Number: 31174 Date: Thu, 20 Jun 91 10:50:19 PDT From: mendel (Mendel Rosenblum) Subject: Re: Allspice running out of memory >How about solving this problem by always using a single size for all >handles, local or remote? There's not much difference in size anyway, >is there? For example, FS could define a union that contains both >local and remote handles and use the size of the union when allocating >space for either handle type. This would make it possible for remote >handle space to be reallocated for local handles, and vice versa. This suggested fix here would solve the problem on allspice. The memory allocator bining is defeating the code that reclaims space from file handles. Using a single size would fix this. The disadvantage of this patch is that it will increase the kernel memory usage on the client machines. The client machines have zero local handles but with this patch all remote handles will occupy 336 rather than 264 bytes. This will increase the size of the handle table on the clients by 27%. Since cleints have around a max of 1000 handles this will mean clients kernels will take 72K more memory. At $100/MB this cost $7 more. There are around 30 sprite machines so it is a 30 * $7 = $210 problem. Assume that an RA is paid around $9 an hour. This means that it would make economic sense to fix this problem if it would take less than $210/$9 = 23 hours. This time limit changes if a faculty member does the fix. The faculty member would charge his or her consulting rate of $250 an hour. In this case, fixing the problem would be worthwhile if it took less than 50 minutes. Since I spent 5 minutes writing this email, we have already spent (5/60) * $9 = 75 cents on this problem. Mendel Log-Number: 31177 Subject: sun4 longjmp clobbers local variable Date: Sat, 22 Jun 91 23:28:40 PDT From: Mike Kupfer <kupfer> The enclosed program demonstrates a bug with gcc 1.37.1 on the sun4. The longjmp (triggered by the 'l' input) restores "firstTime" to an old value. (Before anyone says "but it's allowed to work that way", please note that the setjmp happens every time through the loop.) This problem doesn't appear on murder (even though gcc 1.37.1 is also installed on sun3's) or okeeffe (a CCI Tahoe with gcc 1.39). It is reproducible on shallot (a sun4 with gcc 1.37.1.R). You have to use -O to see the bug. I don't have access to any sun4's running a more recent gcc, so I don't know if this is a known gcc bug. I'll submit a bug report to gnu.gcc.bug. mike -- /* * Test program for automatic variables & setjmp. To show the bug, * run it and give it as input 'a', 'b', and then 'l'. After the 'l' * you'll get "first time", even though you shouldn't. */ #include <stdio.h> #include <setjmp.h> jmp_buf env; main() { int firstTime = 1; int ch; for (;;) { (void)setjmp(env); if (firstTime) { printf("first time\n"); } printf("? "); ch = getchar(); if (ch == 'l') { longjmp(env, 1); } if (ch == EOF) { exit(0); } else { printf("%c\n", ch); } firstTime = 0; } } Log-Number: 31178 Subject: Re: sun4 longjmp clobbers local variable Date: Sun, 23 Jun 91 22:15:57 PDT From: Mike Kupfer <kupfer> Thorsten pointed out that the Software Warehouse has gcc 1.40. I tried it out on shallot and the bug has apparently been fixed. mike Log-Number: 31179 Subject: allspice crash: Fscache_RemoveFileFromDirtyList Date: Mon, 24 Jun 91 13:39:28 PDT From: Mike Kupfer <kupfer> Allspice panicked with the "Fscache_RemoveFileFromDirtyList" bug. Mendel is already looking into this one and said we didn't need to take a core dump, so we just rebooted. mike Log-Number: 31180 Subject: allspice boot problems: phase mismatch Date: Mon, 24 Jun 91 13:43:26 PDT From: Mike Kupfer <kupfer> We were unable to boot the "kupfer" kernel off of allspice's disk. It would start loading the kernel and then the disk access light would go on and stay on. Eventually something like "getbyte error: phase mismatch" and "status = FFFFFFFF" would appear, followed by repeating messages that the SCSI bus was hung. Resetting allspice didn't help, so we booted the "new" kernel. mike Log-Number: 31182 From: mendel (Mendel Rosenblum) Subject: Allspice rebooted Tuesday morning Date: Wed, 26 Jun 91 09:15:50 PDT When I came in Tuesday morning allspice was down. The console had Phil Loarie on it running ethernet diagnostics from the PROM. From what Phil said, Sprite may or may not have been working at the time he pushed the WatchDog Reset button. Mendel Log-Number: 31184 Date: Thu, 27 Jun 91 14:38:16 PDT From: shirriff (Ken Shirriff) Subject: Allspice crash Allspice crashed with the following: cleaned /swap1 Fs_PageCopy: copy failed 50002 Fatal error: MachHandleTrap: error occured in a user process procptr=f6d28078, pc=f604e4b8 entering debugger at f60c2fe4 Log-Number: 31186 From: mgbaker (Mary Gray Baker) Subject: man strcpy broken Date: Thu, 27 Jun 91 22:33:41 PDT Doing a "man strcpy" gets the following: End-of-file in name line for "send". No manual entry for "strcpy". Attempting to re-install the manual page causes a ranlib on the sun4 C library -- always a heart-warming, confidence-building occurrence. Maybe I'll look at this before I pass out. Maybe not. Mary Log-Number: 31187 From: mendel (Mendel Rosenblum) Subject: Re: man strcpy broken Date: Fri, 28 Jun 91 10:25:13 PDT > Return-Path: mgbaker > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA275774; Thu, 27 Jun 91 22:33:44 PDT > From: mgbaker (Mary Gray Baker) > Message-Id: <9106280533.AA275774@sprite.Berkeley.EDU> > To: bugs > Subject: man strcpy broken > Date: Thu, 27 Jun 91 22:33:41 PDT > > Doing a "man strcpy" gets the following: > > End-of-file in name line for "send". > No manual entry for "strcpy". > I noticed this a few days ago and tracked it down to a truncated man index (/sprite/lib/man/lib/c/index). I assumed that this occurred because allspice had crashed the night before about the time cron should have run the reindex of the man pages. I didn't touch anything because I thought that cron would fix things up the next night. I guess it didn't. I just ran the reindex program by hand and everything worked correctly. Mendel Log-Number: 31188 From: mendel (Mendel Rosenblum) Subject: Kernel built from uninstalled mods doesn't enter debugger Date: Fri, 28 Jun 91 11:04:27 PDT A kernel built from the uninstalled mods doesn't enter the debugger correctly on the sun4c. It's pretty neat what happens. When the panic happens the kernel prints the message and switches to the debugger stack. From there it tries to sync the disk. This causes a context switch to happen so the process trying to enter the debugger gets descheduled. Everything else keeps running. You get a "Fatal Error" message and everything keeps working. Your can move the mouse and type in windows. As long as the panic() didn't occur with some important lock held you might never know it happened. We should start charging our users more because we now have a fault-tolerate kernel. No more DDJ. Mendel Log-Number: 31189 From: mgbaker (Mary Gray Baker) Subject: Re: Kernel built from uninstalled mods doesn't enter debugger Date: Fri, 28 Jun 91 12:03:21 PDT Yup. This is a problem. I've just removed the offending code from the sun4 dbg module. However, this is the same code that exists already in all the other machine types. This would seem to explain why decstations sometimes seem to "pop out" of the debugger. It seems clear that we need to invest more effort in this shutdown/panic business if we want it to work. I will remove the offending code from the other machine types too. This means they'll stay in the debugger, but they won't try to sync their disks. Mary Log-Number: 31190 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 28 Jun 1991 17:11:06 PDT Subject: exabyte errors explained Some of the problems with the exabytes are due to a bug in dump. If you specify the "-r" or "-s" options to reinitailize the tape, dump will write out an invalid dump tape label. If you try to access the tape, perhaps to restore or perhaps to do another dump, you get: Dump: The tape does not have a correct label: invalid argument I'm busy rewriting dump to work on the exb8500 and I'll fix this bug. John Log-Number: 31191 From: mendel (Mendel Rosenblum) Subject: Problem with SparcStation and sun4 running out of PMEGs Date: Sat, 29 Jun 91 14:54:43 PDT While doing some stress testing of some changes I made to LFS I ran into the following problem with the Sprite kernel memory management on the sparcStation1 and sun4. Remember that the SparcStation MMU has hardware page mapping tables called PMEGs that are used to map virtual addresses into physical addresses. The way the Sprite kernel is coded it must wire down the PMEGs used to map the Sprite kernel. Awhile back I changed the code not to wire down the PMEGs used to map the kernel's file cache. (This was limiting the size of the file caches on the sun4). On the SparcStation1, there are 128 PMEGs available. Each PMEG map 64 4K pages (256 Kbytes of mapping). 5 of them are allocated to the PROM so are unavailable for Sprite. This leaves 123 PMEGs or around 30 megabytes of mapping available for Sprite, the file cache, and all user level processes. The size of the Sprite kernel code and static data is is around 1428.8 kilobytes which uses 6 PMEGs to map. The malloc()'ed data size of the Sprite kernel is around 5.3 megabytes which requires around 23 PMEGs wired. Next comes the kernel stacks. There are 3 megabytes allowed for kernel stacks. These PMEGs only need to be wired if a process has allowed a stack on the PMEG. Since there is no code trying to allocated kernel stacks on the same PMEGs, it tends to allocate stacks on most of the PMEGs. This accounts for around 10 wired PMEGs. Another 5 PMEGS are wired for devices and DMA mapping. Together this accounts for close to 50/128 (40%) of the PMEGs. The rest (78 PMEGs) are available to user programs and the file cache. The file cache on a SparcStation with 28 megabytes of memory is allocated 21 megabytes of virtual addresses. To totally map this would take 87 PMEGs which is more that we have. Fortunately, the file cache only wires a PMEG when a cache block is being accessed, read, or, written. This is typically only a few blocks at a time. The problem occurs during a LFS segment write. Assume a segment size 512*1024. This means that the write back code may try to write 128 4096-byte blocks at once. If the cache blocks are spread out in the cache, it could cause the entire file cache to become wired. This happened on larceny. The segment being written contained 132 blocks which happened to reside on 78 different PMEGs. Larceny panic'ed because it ran out of PMEGs. Allspice is in slightly better shape because it has 512 pmegs. With a kernel image of 32 megabytes wiring 128 pmegs, it has 380 some PMEGs for the file cache and user processes. Still, if someone were to write 1024 files of size less than 512 bytes and these files that happen to reside on over 380 different PMEGs and LFS tried to write them all into the same segment; the same sort of thing that crashed larceny will happen on allspice. Mendel Log-Number: 31192 Date: Sun, 30 Jun 91 14:21:27 PDT From: shirriff (Ken Shirriff) Subject: Allspice hung for no apparent reason When I got in today, allspice was inert, with a bunch of messages about "reset recv unit". I did an L1-a and continue. It then did a bunch of disk activity and recovery and came back to life. Ken Log-Number: 31193 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 30 Jun 1991 21:36:01 PDT Subject: repeated recovery w/ cory clients The machines over in Cory go through recovery with allspice every few minutes. I assume it is because the gateway is loaded and the RPC times out. Perhaps we could make the timeout longer for clients that are using RPC on top of IP? John Log-Number: 31194 From: mendel (Mendel Rosenblum) Subject: Re: repeated recovery w/ cory clients Date: Mon, 01 Jul 91 11:32:34 PDT > Return-Path: jhh > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA938806; Sun, 30 Jun 91 21:36:05 PDT > Message-Id: <9107010436.AA938806@sprite.Berkeley.EDU> > From: jhh@sprite.Berkeley.EDU (John H. Hartman) > Date: Sun, 30 Jun 1991 21:36:01 PDT > X-Mailer: Mail User's Shell (7.1.1 5/02/90) > To: bugs > Subject: repeated recovery w/ cory clients > > > The machines over in Cory go through recovery with allspice every few minutes. > I assume it is because the gateway is loaded and the RPC times out. Perhaps > we could make the timeout longer for clients that are using RPC on top > of IP? > > John I suspect this is a bug in the recovery system rather than problems with RPC or the gateway. The /etc/crossmount file on king contains obsolete entries such as /boot/cmds/prefix -a /user2 -s assault /boot/cmds/prefix -a /sprite/src -s allspice This hardwires the serverID for this prefixs to be assault and allspice. The problem is that this file system don't exist of the specified machines so anything that tries to touch these prefixs cause RPCs that timeout. This invokes the recovery but the recovery system does nothing to stop the timeouts. The easiest patch for this would be to correct the /etc/crossmount file in cory. This appears to have been fixed in the new kernel. Mendel Log-Number: 31195 Date: Mon, 1 Jul 91 15:15:07 PDT From: Dean Long <dlong@cedar.ucsc.edu> Subject: console and serial problems The RawProc's that handle console and serial output don't seem to be working right. For example, "echo 12345 > /dev/ttya" only outputs the first two characters ("12"). I think the problem is that there are places in the code that cause characters to be output only if the buffer has *just* become non-empty. If the buffer is already non-empty and new characters are added, nothing happens. dl Log-Number: 31196 Date: Mon, 1 Jul 91 15:24:30 PDT From: dlong@dogwood.ucsc.edu (Dean Long) Subject: faster console output The following diff to sun4c.d/devConsole.c will speed up console output by writing more than one character per PROM call. Calling the PROM function directly seems to work without disabling/enabling interrupts like Mach_MonPutChar does. dl *** /tmp/,RCSt1524570 Mon Jul 1 15:16:13 1991 --- devConsole.c Mon Jul 1 14:51:12 1991 *************** *** 154,165 **** char *outBuffer; /* Output buffer. */ { register DevZ8530 *zPtr = ptr; /* Information about keyboard device. */ ! int c; if (operation != TD_RAW_OUTPUT_READY) { return 0; } ! while (TRUE) { /* * Note: must call DevTtyOutputChar directly, rather than calling --- 154,166 ---- char *outBuffer; /* Output buffer. */ { register DevZ8530 *zPtr = ptr; /* Information about keyboard device. */ ! char buf[TTY_OUT_BUF_SIZE]; ! int c, i; if (operation != TD_RAW_OUTPUT_READY) { return 0; } ! for (i = 0; i < sizeof buf; ++i) { /* * Note: must call DevTtyOutputChar directly, rather than calling *************** *** 172,180 **** if (c == -1) { break; } ! while (Mach_MonMayPut(c & 0x7f) == -1) { ! /* Empty loop; just try again. */ ! } } return 0; } --- 173,182 ---- if (c == -1) { break; } ! buf[i] = c & 0x7f; ! } ! if (i > 0) { ! (*romVectorPtr->fbWriteStr)(buf, i); } return 0; } Log-Number: 31197 From: mgbaker (Mary Gray Baker) Subject: debugging decstations Date: Mon, 01 Jul 91 18:59:48 PDT When using kdbx, if I put a breakpoint in FsrmtFilePageRead, the machine stops there correctly, but when I try to print out the stack trace, it prints out the first few frames and then hangs. It's printing out "T1;" repeatedly on the console of the machine being debugged. How do I get this stack trace? Mary Log-Number: 31198 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 1 Jul 1991 22:01:12 PDT Subject: decstation debugging I would guess that you hit a bug in dbx. The "TI" messages indicate that the kernel is trying to respond to the debugger, but the debugger isn't listening. Basically, the kernel and the debugger are out of sync. I have a personal version of kmsg that whacks the kernel out of the TI loop, but I haven't installed it. I did that to loiter, and found that the the stack was messed up. I'll keep my eye on the debugger situation. John Log-Number: 31200 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 1 Jul 1991 22:47:39 PDT Subject: more on debugging decstations I think the decstations aren't strict enough about what constitutes a valid address when you're in the debugger. Sometimes when back-tracing the stack the machine will suffer a load miss in Dbg_Main trying to load from the requested address. This is obviously difficult to debug, but I'll look into it. John Log-Number: 31201 Date: Tue, 2 Jul 91 09:13:26 PDT From: bmiller (Bob Miller) Subject: allspice down Allspice died about 8:15 this morning. The messages on the console were: vmMemEnd = 0xf80000f8 - Fatal Error = Vm-RawAlloc: Out of memory Entering debugger with a Interrupt Trap (16) exception at PC 0xf60c342c I tried to do some additional checking, but couldn't get anything to work! Are those instructions taped to allspice's rack still valid? Anyway...I reset allspice and re-booted it. Bob Log-Number: 31202 Date: Wed, 3 Jul 91 08:33:51 PDT From: bmiller (Bob Miller) Subject: allspice down Allspice was down when I came in this morning... Fatal Error: LfsOkToRead read from clean segment Entering debugger with a Interrupt Trap (16) exception at PC 0xf60c394c Log-Number: 31203 From: mgbaker (Mary Gray Baker) Subject: problem rebooting 5000's remotely Date: Wed, 03 Jul 91 16:55:31 PDT Neither "shutdown -R" nor "kmsg -d" & "kmsg -R" seem to reboot a ds5000 successfully. It seems there is no way to reboot a ds5000 remotely. Mary Log-Number: 31206 Date: Thu, 4 Jul 91 00:11:03 PDT From: eklee (Edward K. Lee) Subject: raid1 paniced in LFS I got a core dump (/sprite/src/kernel/sprite/raid1.1.096.core) in case anyone wants to look at it. Go ahead and delete the core file when you're finished. I'll delete the core file in a week if it's still there. Ed Log-Number: 31207 Date: Fri, 5 Jul 91 08:27:19 PDT From: bmiller (Bob Miller) Subject: allspice down - 8 AM, 7/5 Allspice was down when I came in today... Fatal Error: MachPageFault: page fault in kernel process! pc: 0xf607cbe8, addr: 0xecdbfe40 Error: 0x80 Entering debugger with a Interrupt Trap (16) exception at PC 0xf60c397c Log-Number: 31208 From: mgbaker (Mary Gray Baker) Subject: allspice crashed in LFS mousetrap Date: Fri, 05 Jul 91 12:50:20 PDT Allspice crashed with a panic saying "LfsOkayToRead - clean segment" or something like that. Mendel is now looking at the core file for it. Mary Log-Number: 31209 From: mgbaker (Mary Gray Baker) Subject: allspice died on DMA bus error Date: Fri, 05 Jul 91 19:01:29 PDT The most recent allspice crash was due to a dma bus error on the 3rd HBA. Mary Log-Number: 31210 Date: Mon, 8 Jul 91 11:05:01 PDT From: ouster (John Ousterhout) Subject: 2 Allspice crashes Both Allspice crashes this morning were due to power failures. The first power failure took out the entire campus for about a half hour at about 8:15. The second was a problem with the distribution box in the machine room. -John- Log-Number: 31213 Date: Tue, 9 Jul 91 15:37:56 PDT From: ouster (John Ousterhout) Subject: Mail dead Mail doesn't seem to be getting through to Sprite from the outside world. Can the DDJ restart the mail daemons on Allspice? Thanks. -John- Log-Number: 31214 Date: Tue, 9 Jul 91 17:42:50 PDT From: eklee (Edward K. Lee) Subject: gdb crashes whire reading symbol table Try: gdb /users/eklee/src/xatax/ds3100.md/xatax.save. Ed Log-Number: 31216 Date: Thu, 11 Jul 91 03:42:46 PDT From: eklee (Edward K. Lee) Subject: previous gdb bug report I fixed it, you can remove this bug from the list. It was an obscure initialization bug which appeared whenever alloca did not return a zeroed area of memory. Ed Log-Number: 31215 From: mendel (Mendel Rosenblum) Subject: SOSP trace collection changes break file handle scavenge Date: Wed, 10 Jul 91 21:39:05 PDT While trying to generate timing for the LFS recovery program I discovered that the Sprite kernel runs out of memory if you create lots (~ 10000) files. The problem is that the Sprite file servers no longer reclaim the space used by a file handle that a client wrote. Handles are not reclaimed as long as there is a valid lastWriter field in the file handle. The lastWriter field is invalided by the routine Fsconsist_DeleteLastWriter() which is never called in the current Sprite kernel. Fsconsist_DeleteLastWriter() use to be called from the remote Write RPC stub (Fsrm_Write) when the flags in the RPC specified that this was the last dirty block from the client. As part of the SOSP tracing this code and even the flag that specified the last block was removed. This means that file handles are not scavengeble until the client that writes them crashes. I suspect that this change also accounted for the increase in kernel size that we saw recently. Mendel Log-Number: 31217 Date: Thu, 11 Jul 91 13:36:23 PDT From: ouster (John Ousterhout) Subject: Mail down again? Mail service into allspice from the outside world seems to be down again. Help, DDJ? -John- Log-Number: 31218 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 11 Jul 1991 14:09:18 PDT Subject: readdir() incompatibility readdir on SunOS (ginger) returns struct dirent, while on Sprite it returns struct direct. These structures are not the same. John Log-Number: 31219 From: mgbaker (Mary Gray Baker) Subject: exb1 stuck busy on allspice Date: Thu, 11 Jul 91 19:06:27 PDT Is there any way, other than rebooting allspice, to deal with the following error? initializing /dev/exb1.nr opening /dev/exb1.nr as archive file Dump: Can't open `/dev/exb1.nr': text file or pseudo-device busy Init failed on tape /dev/exb1.nr Mary Log-Number: 31220 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 11 Jul 1991 21:19:30 PDT Subject: Re: exb1 stuck busy on allspice This will happen if the device is already open by another process. At the moment there are two dumps running on allspice. John ps -au | grep dump root f0e50 0.0 0.0 168 64 SUSP 0:00 dump -s -f /dev/exb1.nr root f0e55 0.0 0.0 168 64 RWAIT 0:00 dump -s -f /dev/exb1.nr Log-Number: 31221 From: mgbaker (Mary Gray Baker) Subject: allspice mystery death Date: Fri, 12 Jul 91 23:22:28 PDT A little before 11pm allspice stopped. It didn't get into the debugger correctly, and so I was unable to get a core image. It printed nothing of interest on its console, especially not a message as to why it was going into the debugger. Mary Log-Number: 31222 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 15 Jul 1991 11:46:48 PDT Subject: ds5000 wedged When I came in this morning loiter was a zombie. It wouldn't respond to rpcs, the keyboard, or kmsg -d. I reset it, and found that it was inside of MachKernelExceptionHandler. Perhaps it was in some sort of infinite loop handling an exception. Let's keep our eyes open for similar behavior. John Log-Number: 31223 Date: Mon, 15 Jul 91 15:37:14 PDT From: pmchen (Peter M. Chen) Subject: mail daemon I think when you restarted mail, it lost the mail that had backlogged. Is this inevitable? (I got backedlogged mail last time mail died and was restarted--Ken Shirriff was in on that one). By the way, this is the second time today that mail has died. Is there a watchdog program to restart it (like the IPServer)? This didn't use to happen to mail. Pete Log-Number: 31224 Date: Mon, 15 Jul 91 17:06:22 PDT From: shirriff (Ken Shirriff) Subject: Sendmail problems Pete Chen has been frequently encountering problems with the sendmail daemon getting stuck. We should try to fix this before our users get too upset. Log-Number: 31227 Date: Tue, 16 Jul 91 00:00:52 PDT From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice wedged then died Everything seemed to get wedged up behind piracy tonight. I was poking around allspice trying to figure it out, and then allspice died with a Mem_Free: storage block already freed error. So I took a core dump and rebooted. Ken Log-Number: 31228 Subject: does file locking work with nfsmount? Date: Tue, 16 Jul 91 07:32:57 PDT From: Mike Kupfer <kupfer> I was reading news on shallot (the Sprite rn needs to be rebuilt using a bigger newsrc size), and I tried to copy some articles into my MH inbox using inc -file ginger/News/articles ("ginger" is a symbolic link to /home/ginger/sprite/users/kupfer). After a long pause I got unable to lock and fopen /user6/kupfer/ginger/News/articles I assume this is some sort of nfsmount problem. It's probably not worth spending a great deal of time on, though. mike Log-Number: 31229 From: mendel (Mendel Rosenblum) Subject: Re: does file locking work with nfsmount? Date: Tue, 16 Jul 91 11:19:20 PDT > > I was reading news on shallot (the Sprite rn needs to be rebuilt using > a bigger newsrc size), and I tried to copy some articles into my > MH inbox using > > inc -file ginger/News/articles > > ("ginger" is a symbolic link to /home/ginger/sprite/users/kupfer). > After a long pause I got > > unable to lock and fopen /user6/kupfer/ginger/News/articles > > I assume this is some sort of nfsmount problem. It's probably not > worth spending a great deal of time on, though. > > mike The nfsmount daemon doesn't implement the IOC_LOCK Fs_IoControl so it returns the error code GEN_NOT_IMPLEMENTED which equals 4. Since the nfsmount uses the pdev library that treats error codes as Unix errnos it assumes the error code was really EINTR == 4 (Interrupted system call). This gets map to GEN_ABORTED_BY_SIGNAL before it is returned to the kernel. This could be a problem because GEN_ABORTED_BY_SIGNAL errors are normally retried by the stubs in libc in the assumption that they were caused by migration. Fortunately, someone already discovered this problem and special-cased the Fs_IOControl stub not to retry IOC_LOCKs that are aborted by signals. Had you tried to unlock the file your process would have gone into an infinite loop doing RPCs to assault until one of the RPCs suffered at timeout. A simple patch would be to make nfsmount return SUCCESS for IOC_LOCK and IOC_UNLOCK so programs will not abort. Implementing the locking would be more work. Mendel ps The long pause was probably due to the inc program which retries the open/flock calls 5 times with 5 second sleep between them. Log-Number: 31233 From: mendel (Mendel Rosenblum) Subject: Patch for sendmail hangup Date: Tue, 16 Jul 91 16:00:26 PDT I put a patch in sendmail that reopens the socket when it gets an error on the accept() call. This appears to get around the problem with sendmail hanging up. Mendel Log-Number: 31235 Date: Wed, 17 Jul 91 16:12:31 PDT From: margo (Margo Seltzer) Subject: ds3100 crashed creating lfs directory Kvetching's disk (formerly babylon's) was just repartitioned into an old fs (~200M) and an lfs (~100M). The first time I tried creating a directory on the lfs, the kernel crashed with a Reserved Instruction at 0x80092c58. - M Log-Number: 31236 From: mendel (Mendel Rosenblum) Subject: Fixed bug in connect() Date: Wed, 17 Jul 91 16:34:12 PDT I fixed a bug in the connect() library routine that was causing sendmail to act have problems sending mail and sometimes go into the debugger. The problem was the connect() routine was passing the global "constant" time_OneMinute to Fs_Select(). If the connection timed-out, Fs_Select() would get time_OneMinute to zero. This caused all future connect() request to timeout. Mendel Log-Number: 31237 Subject: repl dies on long "to" list? Date: Fri, 19 Jul 91 17:09:15 PDT From: Mike Kupfer <kupfer> When I try to reply to Terry's message (below), repl (the MH reply program) gets a segmentation fault. After building an executable that has symbols, I put it into the debugger. The problem seems to come from a realloc() of the buffer that contains the "to" list. When I relinked with -lc_g, repl didn't die, but the "cc" list (constructed from the "to" list) was truncated and slightly munged. It's not immediately obvious to me whether this is a bug in repl or in malloc(). mike -- Return-Path: theresa@shallot.Berkeley.EDU Received: from shallot.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA331343; Thu, 18 Jul 91 16:37:24 PDT Received: by shallot.Berkeley.EDU (4.1/1.42) id AA05115; Thu, 18 Jul 91 16:37:17 PDT Date: Thu, 18 Jul 91 16:37:17 PDT >From: theresa@shallot.Berkeley.EDU (Theresa Lessard-Smith) Message-Id: <9107182337.AA05115@shallot.Berkeley.EDU> To: ani@ucbarpa.berkeley.edu, ann@guitar.Berkeley.EDU, bmiller@sprite.Berkeley.EDU, chiueh@ginger.Berkeley.EDU, ss@joyride.Berkeley.EDU, pfile@villandry.Berkeley.EDU, pfile@cory.Berkeley.EDU, msilva@sprite.Berkeley.EDU, delnaz@miro.Berkeley.EDU, delnaz@ucbarpa.berkeley.edu, joel@sprite.Berkeley.EDU, seth@miro.Berkeley.EDU, glenn@ucbarpa.berkeley.edu, tzeng@ucbarpa.berkeley.edu, ksmith@miro.Berkeley.EDU, chinrung@miro.Berkeley.EDU, schauser@boing.Berkeley.EDU, tve@sprite.Berkeley.EDU, spriters@sprite.Berkeley.EDU, rquiros@chism.Berkeley.EDU, rquiros@king.Berkeley.EDU, mani@cory.Berkeley.EDU, vahdat@cory.Berkeley.EDU, funk@miro.Berkeley.EDU, ajay@miro.Berkeley.EDU, rice@miro.Berkeley.EDU, kyi@cory.Berkeley.EDU, claire@postgres.Berkeley.EDU, crystal@ucbarpa.berkeley.edu, glen@ra.Berkeley.EDU, jean@ucbarpa.berkeley.edu, liza@ucbarpa.berkeley.edu, madfitz@ucbarpa.berkeley.edu, raid@ginger.Berkeley.EDU, sarahb@harmony.Berkeley.EDU, sharon@ucbarpa.berkeley.edu, teresa@ucbarpa.berkeley.edu, theresa@shallot.Berkeley.EDU, tonys@guitar.Berkeley.EDU, xprs@ginger.Berkeley.EDU, raid@sprite.Berkeley.EDU, cori@ginger.Berkeley.EDU, gibson@ginger.Berkeley.EDU, ho@ginger.Berkeley.EDU, luigi@ginger.Berkeley.EDU, schauser@boing.Berkeley.EDU, bertrand@buzz.Berkeley.EDU, dedood@burble.Berkeley.EDU, sah@sprite.Berkeley.EDU, decman@boing.Berkeley.EDU, flaster@boing.Berkeley.EDU, tve@sprite.Berkeley.EDU, moreton@miro.Berkeley.EDU, funk@miro.Berkeley.EDU, clay@miro.Berkeley.EDU, shirman@miro.Berkeley.EDU, chandra@ucbarpa.berkeley.edu Subject: new CS bldg. furniture I would like to have your opinion on one area of furniture for offices in the new CS building -- that item is: your chair. If you had a preference, would you want a desk chair that had arms or no arms. For the kind of work and sitting that you do, which is your preference? Please let me know within the next few days. Thanks for your help. Terry Log-Number: 31239 From: mgbaker (Mary Gray Baker) Subject: allspice crash in fscache module Date: Sun, 21 Jul 91 15:17:23 PDT Allspice crashed today with the following message: Fatal Error: Fscache_RemoveFileFromDirtyList blocks in cache. The previous message on the console indicated that it had just done an Lfs checkpoint, but I don't know how long before or whether this is related. This message said DirtyBlocks (2) after a checkpoint The corefile for debugging is called vmcore.removefiledirtylist. I'll try to get around to debugging it, but I've gotta finish a few things first, so anyone else is welcome to have a shot at it. Mary Log-Number: 31240 Date: Mon, 22 Jul 91 10:53:03 PDT From: ouster (John Ousterhout) Subject: Core leak in execvp In tracking down what seemed to be core leaks in Tcl today I found that the real problem is in execvp. It used malloc to allocate a couple of buffers. After a vfork, the child and parent share heap, so these mallocs consume space in the parent which isn't freed after the child execs (and execvp can't free the space before it execs). I fixed the problem by switching to fixed-size buffers for a couple of things. This can cause exec's to fail when they would succeed otherwise (e.g. if the name of a file to exec is longer than 1000 chars or a shell script executed by default (i.e. without #! notation) has more than 1000 arguments), but I don't know any other way to solve the problem. I've modified and tested the code, and I installed the "exec" subdirectory of libc, but I didn't reinstall the whole C library. -John- Log-Number: 31242 From: mendel (Mendel Rosenblum) Subject: More on gdb inserting ^P over rlogin connections Date: Tue, 23 Jul 91 11:21:11 PDT You have probably noticed that using gdb over rlogin connections causes random ^P's to be inserted in the output stream and output is sometimes lost. The problem appears to be bug in either the ipServer or unix compat stuff with sockets and out-of-band data. Gdb uses the readline library package that sets and resets certain tty driver attributes before and after each line it reads. One of the things it does is to undefine the stop/start characters and reset them the to the orginial state after reading the line. Rlogind uses out-of-band messages to instruct the local rlogin process to change the flow control characters. This uses of out-of-band data appears to cause some of the output to be lost. Also, some of the out-of-band data get inserted into the data stream (^P is the out-of-band command to turn off the flow control characters.) We can hope that this problems will go away when we put the inet code in the kernel. In the mean time here are a couple of possible ways around the problem: 1) Use telnet or tx rather than rlogin to access the remote system. 2) Undefine the start/stop characters before using gdb. (ie. stty start u stop u) Mendel Log-Number: 31243 Subject: /unix/cmds.ds3100/import.ds3100 is ugly (whining) Date: Wed, 24 Jul 91 15:17:47 PDT From: Mike Kupfer <kupfer> well, it's not ugly by itself, but it makes "df" format each line to be longer than 80 columns, which is ugly. mike Log-Number: 31244 Date: Wed, 24 Jul 91 15:26:03 PDT From: shirriff (Ken Shirriff) Subject: Re: /unix/cmds.ds3100/import.ds3100 is ugly (whining) I've changed /unix/cmds.ds3100/import.ds3100 to /unix/import.ds3100. Now the df is nice. Ken Log-Number: 31245 Date: Thu, 25 Jul 91 11:41:42 PDT From: pmchen (Peter M. Chen) Subject: machparam.h /usr/include/sys/wait.h includes <machparam.h> which doesn't exist. It looks like its new place is /usr/include/machine/machparam.h Should /usr/include/sys/wait.h be updated? Pete Log-Number: 31246 From: mendel (Mendel Rosenblum) Subject: Re: machparam.h Date: Thu, 25 Jul 91 12:12:13 PDT > > /usr/include/sys/wait.h includes <machparam.h> which doesn't exist. It > looks like its new place is /usr/include/machine/machparam.h > > Should /usr/include/sys/wait.h be updated? > > Pete This is a decStation only problem caused by using the Ultrix compilers on Sprite. The cc command on the DecStation uses the ultrix preprocessor which has a default search path of /usr/include. The rest of the machines use GNU cpp that has a default path of /usr/include and /usr/include/${MACHINE}.md. machparam.h is /usr/include/${MACHINE}.md. I changed wait.h to include <machine/machparam.h> which should get around this problem. We are going to need this for Unix compat to work correctly. Mendel Log-Number: 31247 Date: Thu, 25 Jul 91 16:01:07 PDT From: pmchen (Peter M. Chen) Subject: screwed up mail My mail file is screwed up. Try "tail /usr/spool/mail/pmchen", and you'll get lots of I noticed this because my xbiff icon lit up, yet there was no new mail. Pete Log-Number: 31248 Date: Thu, 25 Jul 91 16:05:32 PDT From: pmchen (Peter M. Chen) Subject: /tmp is full (with 96MB free)? I get lots of the following messages: 7/25/91 15:53:45 allspice (14) RmtFile "/tmp/ftp535617" <3,55> Write-back failed: out of disk space<40008> but df /tmp returns Prefix Server KBytes Used Avail % Used /tmp allspice 270336 147119 96183 60% Pete Log-Number: 31249 Date: Thu, 25 Jul 91 16:11:19 PDT From: pmchen (Peter M. Chen) Subject: machparam.h The same situation (as in <sys/wait.h>) also occurs in <sys/scsi.h> and <sys/param.h> Pete Log-Number: 31250 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 25 Jul 1991 17:11:50 PDT Subject: Re: machparam.h You'll be better off using mkmf to produce Makefiles. It will set upt the -I flags correctly so your stuff will compile. John Log-Number: 31251 Subject: Peter's mail file fixed Date: Thu, 25 Jul 91 17:14:32 PDT From: Mike Kupfer <kupfer> It apparently got a huge number of ASCII DEL's stuck at the very end. I whacked off the last line with vi, which apparently got rid of the DEL's. mike Log-Number: 31252 Date: Thu, 25 Jul 91 18:39:52 PDT From: pmchen (Peter M. Chen) Subject: Re: Peter's mail file fixed Yes, but I happened to lose about 3 messages in the process. The only reason I realized this was because I have a copy of my mail forwarded to another machine. Pete Log-Number: 31253 From: mendel (Mendel Rosenblum) Subject: Re: Peter's mail file fixed Date: Thu, 25 Jul 91 18:45:06 PDT > Return-Path: pmchen > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA470080; Thu, 25 Jul 91 18:39:52 PDT > Date: Thu, 25 Jul 91 18:39:52 PDT > From: pmchen (Peter M. Chen) > Message-Id: <9107260139.AA470080@sprite.Berkeley.EDU> > To: kupfer > Subject: Re: Peter's mail file fixed > Cc: bugs > > Yes, but I happened to lose about 3 messages in the process. The only > reason I realized this was because I have a copy of my mail forwarded to > another machine. > > Pete Maybe this is related to the messages in allspice's syslog when /tmp filled. There were many messages of the form: Fscache_Write: Alloc failed <3,3> "ma069201" DISK FULL <9>Jul 25 17:42:05 syslog: MAIL: Suspected mail fault!!! <9>Jul 25 17:42:05 syslog: MAIL: message: Anyone know what produced these errors. Mendel Log-Number: 31254 Date: Thu, 25 Jul 91 21:23:42 PDT From: shirriff (Ken Shirriff) Subject: Re: Peter's mail file fixed The "Suspected mail fault" message comes from the "mail" program, if it is given a garbaged mail file to deliver. Here is what seems to have happened: The sendmail program was trying to deliver a spooled mail file, so it piped it into "mail". However, the spooled mail file was empty or nulls, so the mail program complained. So apparently the spooled mail file is what got trashed. Ken Log-Number: 31255 From: mendel (Mendel Rosenblum) Subject: Panic message not printed on Allspice's console. Date: Fri, 26 Jul 91 10:14:14 PDT Allspice crashed yesterday during the Sprite meeting with the message: Entering debugger with a Interrupt Trap (16) exception at PC 0x... This message was caused by the DBG_CALL macro in the panic() routine. Unfortunately, the panic message itself was lost. I suspect that this is due to the newtee program being used to capture the syslog. On a panic() call the system goes down before the newtee program has a chance to output the panic() message to the console. Because the newtee program has already read the message from the syslog buffer it is not in the kernel core file either and is sometimes hard to reconstruct the arguments to panic() using the backtrace. Mendel Log-Number: 31256 From: mendel (Mendel Rosenblum) Subject: Problems with /tmp Date: Fri, 26 Jul 91 10:37:01 PDT /tmp filled several times yesterday. The problem was the LFS on /tmp was configured to allow only 65% of the disk to be used. I corrected this problem but it won't take effect until allspice is rebooted. Part of problem is LFS files systems inform the "df" command of the real disk capacity utilization and not the fraction of usable disk space. To avoid further confusion I patched LFS to lie like the Unix file system does. Now it will say 100% Used, 0 Avail when the file system can take no more. Mendel ps I guess to be really Unix compatible the df command should say 110% Used when the disk is full. I think I'm going to modify LFS to let users use 120% of the disk; then LFS will allow you to use 10% more space than Unix. Log-Number: 31257 From: mendel (Mendel Rosenblum) Subject: Allspice panic with disk full Date: Fri, 26 Jul 91 10:56:38 PDT Allspice crashed yesterday during the Sprite meeting when /tmp ran out of disk space. The crash had a previously reported error message of: Fatal Error: LfsError on: /tmp status 0x1, Can't update descriptor map. Contrary to the error message, this is not a problem in LFS. The problem is in fslclLookup.c in the routine CreateFile(). It occurs when a file or directory can't be created or added to a directory because the disk is full. If the directory block create or the component insert fails the code releases the newly created handle, frees the memory allocated for the file descriptor memory, and deallocates the file number. Unfortunately, it leaves the handle inserted in the handle table pointing at unallocated memory for its descriptor and possible with dirty blocks in the cache. LFS panics when it finds this file because it's file number is not allocated. It's going to take more than a one-line bug fix to back out of the mess left when this happens. If you ever see the message: DISK FULL followed by "CreateFile: unwinding" this problem just happened and the system doesn't have long to live. Mendel Log-Number: 31258 Date: Fri, 26 Jul 91 14:26:17 PDT From: ouster (John Ousterhout) Subject: Strange mail behavior Several times in the last hour I have received the "You have new mail" blip from the shell, but when I entered the mail program there were no new messages. This makes me *very* nervous. Is anyone else experiencing the same behavior? Could mail somehow be getting lost? -John- Log-Number: 31259 Date: Fri, 26 Jul 91 14:28:46 PDT From: bmiller (Bob Miller) Subject: Re: Strange mail behavior I had this happen to me 3 or 4 times yesterday... Log-Number: 31260 Date: Fri, 26 Jul 91 14:30:53 PDT From: ouster (John Ousterhout) Subject: Re: Strange mail behavior I think I've fixed the problem: seems there were a bunch of NULLs in my mail spool file, and somehow the NULLs convinced the "mail" program to ignore everything after them. I deleted the NULLs and several new messages suddenly appeared. -John- Log-Number: 31261 From: mendel (Mendel Rosenblum) Subject: Re: Strange mail behavior Date: Fri, 26 Jul 91 14:37:41 PDT I suspect that the mail problems are related to /tmp filling up. From allspice's syslog: Fscache_Write: Alloc failed <3,3> "xlisp.trace.vm" DISK FULL Fscache_Write: Alloc failed <3,3> "ma003667" DISK FULL <9>Jul 26 13:52:34 syslog: MAIL: Suspected mail fault!!! <9>Jul 26 13:52:34 syslog: MAIL: message: <54>Jul 26 13:53:40 lpd[10e54]: lw608-1: lost connection Fscache_Write: Alloc failed <3,3> "ma593502" DISK FULL <9>Jul 26 13:54:08 syslog: MAIL: Suspected mail fault!!! <9>Jul 26 13:54:08 syslog: MAIL: message: Fscache_Write: Alloc failed <3,3> "ma921147" DISK FULL <9>Jul 26 13:54:49 syslog: MAIL: Suspected mail fault!!! <9>Jul 26 13:54:49 syslog: MAIL: message: Fscache_Write: Alloc failed <3,3> "ma069187" DISK FULL <9>Jul 26 13:54:49 syslog: MAIL: Suspected mail fault!!! <9>Jul 26 13:54:49 syslog: MAIL: message: I talked with chiueh and hopefully this will not happen again. Mendel Log-Number: 31263 Date: Sat, 27 Jul 91 21:13:50 PDT From: elm (ethan miller) Subject: runaway sendmail on assault? There is a sendmail process (81949) on assault that seems to be out of control. It has used 1885 seconds of CPU time so far. Anyone want to look at, or should it just be killed? ethan Log-Number: 31264 Date: Sat, 27 Jul 91 21:28:01 PDT From: shirriff (Ken Shirriff) Subject: Re: runaway sendmail on assault? I tried to debug the sendmail process (81949), but I couldn't get it into the debugger. Looking at the code, it blocks the Unix SIGQUIT signal, which is equivalent to the Sprite DEBUG signal. So unless someone knows a secret way to get processes into the debugger, I don't think we can debug it. Ken Log-Number: 31265 Date: Sun, 28 Jul 91 11:58:09 PDT From: mendel (Mendel Rosenblum) Subject: Re: runaway sendmail on assault? Try sending it a single that it doesn't catch but still puts in it the debugger. Something like SIGILL or SIGBUS or SIGFPE will work. You will have to use the gdb "handle" command if you want to continue execution after attaching the process. Mendel Log-Number: 31266 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: runaway sendmail on assault? Date: Sun, 28 Jul 91 21:35:09 +0200 >>>>> On Sat, 27 Jul 91 21:28:01 PDT, shirriff@sprite.Berkeley.EDU >>>>> (Ken Shirriff) said: Ken> I tried to debug the sendmail process (81949), but I couldn't Ken> get it into the debugger. Looking at the code, it blocks the Ken> Unix SIGQUIT signal, which is equivalent to the Sprite DEBUG Ken> signal. So unless someone knows a secret way to get Ken> processes into the debugger, I don't think we can debug it. >>>>> mendel@sprite.Berkeley.EDU (Mendel Rosenblum) adds: Mendel> Try sending it a single that it doesn't catch but still Mendel> puts in it the debugger. Something like SIGILL or SIGBUS Mendel> or SIGFPE will work. You will have to use the gdb "handle" Mendel> command if you want to continue execution after attaching Mendel> the process. I think STOP will have the best of both worlds -- as I recall, it's just like the DEBUG signal except that the DEBUG signal is catchable and sets a flag so you know the process is in a special state. But you can debug a suspended process just like a "debuggable" one. Fred Log-Number: 31268 Date: Sun, 28 Jul 91 15:54:31 PDT From: shirriff (Ken Shirriff) Subject: Re: runaway sendmail on assault? I debugged the runaway sendmail. It was in an infinite loop in malloc. Apparently a block of memory got overwritten with 0's. I couldn't figure out where this happened. Ken Log-Number: 31269 Date: Mon, 29 Jul 91 08:46:29 PDT From: margo (Mary Gray Baker) Subject: "Bogus bp-trap" on login Kvetching (running on the "new" kernel) responds to pings, but doesn't allow any logins. Both when rlogging in and logging in from the console, the message "Bogus bp-trap" appears. I put it in the debugger but was unable to get any useful information from it. - M Log-Number: 31270 Date: Mon, 29 Jul 91 08:48:12 PDT From: bmiller (Bob Miller) Subject: printer hung The printer here in our office (lw533) is hung. Can someone check into fixing the problem? Thanks. Bob Log-Number: 31271 Date: Mon, 29 Jul 91 17:00:51 PDT From: shirriff (Ken Shirriff) Subject: L1-B into debugger L1-? doesn't list any function for L1-B, but L1-B throws machines into the debugger, much to my surprise. Apparently L1-B used to be the function for the serial line debugger. It is no longer listed in the L1 functions, but it still works. So should I remove the L1-B function? Ken Log-Number: 31273 Date: Tue, 30 Jul 91 12:37:53 PDT From: shirriff (Ken Shirriff) Subject: coons console messed up Text to the console of coons (a color ds5000) gets totally messed up if the text wraps around the end of the line. (It prints a whole line of the character that wraps around.) Log-Number: 31274 Date: Wed, 31 Jul 91 12:05:29 PDT From: pmchen (Peter M. Chen) Subject: df contacts greed? When I do a "df /user4" on mustard (ds5000), it apparently has to wait for greed to RPC timeout before responding. Greed is now listed as being down (doesn't ping). <domain info> 7/31/91 12:03:05 greed (24) RPC timed-out This is a pretty minor annoyance, but why does df need to contact greed? This has been happening for a few days. Pete Log-Number: 31275 Date: Wed, 31 Jul 91 12:09:15 PDT From: shirriff (Ken Shirriff) Subject: Re: df contacts greed? I did "prefix -d /graphics" on mustard, and now df doesn't wait for greed. (/graphics is served by greed.) Apparently df contacts all the machines in the prefix table, even if you just want one file system. Ken Log-Number: 31276 Date: Wed, 31 Jul 91 23:36:57 PDT From: pmchen (Peter M. Chen) Subject: missed mail I got sent two messages (both from rquiros@sprite) that never appeared in my mbox on sprite. The reason I know they were sent is because my mail gets forwarded/copied to ginger. Is this due to / filling up (as per Mary's "messed up mail" message)? Pete Log-Number: 31277 From: mendel (Mendel Rosenblum) Subject: New kernel (1.097) bug with raw disk devices Date: Thu, 01 Aug 91 10:15:01 PDT A sparcStation1 or sparcStation2 has trouble reading a raw disk device when running a new kernel. For example: jaywalk% dd if=/dev/rsd01c of=/dev/null bs=64k count=1 0+1 records in 0+1 records out only transfers 512 bytes and generates the syslog messages: DevRawBlockDevRead: error 0x0 inLength 65536 at offset 0x0 outLength 512 The problem occurs on any read of the raw device larger than 2048 bytes. I haven't tried this on the decStations. Mendel ps I wouldn't boot this kernel on a machine with local file systems. Fscheck might get confused and destory the file systems. Log-Number: 31278 From: mendel (Mendel Rosenblum) Subject: Re: New kernel (1.097) bug with raw disk devices Date: Thu, 01 Aug 91 11:01:10 PDT > I haven't tried this on the decStations. > > Mendel The problem occurs on ds5000 machines but not sun4 (sun4/200) machines. Mendel Log-Number: 31279 Date: Thu, 1 Aug 91 11:06:42 PDT From: shirriff (Ken Shirriff) Subject: Re: New kernel (1.097) bug with raw disk devices I booted the 1.097 kernel on kvetching yesterday, and it worked fine. (Kvetching is a ds3100 serving /postdev) Ken Log-Number: 31281 Date: Thu, 1 Aug 91 16:37:33 PDT From: elm (ethan miller) Subject: RPCs to allspice hang during LFS cleaning Is there any way to allow allspice to continue normal operation while doing LFS segment cleaning? It's annoying to have allspice hang for two or three minutes while it cleans segments. Is this intrinsic to the segment cleaning operation, or could this be fixed? ethan Log-Number: 31285 Date: Sun, 4 Aug 91 13:56:48 PDT From: mottsmth (Jim Mott-Smith) Subject: "listen" socket call behavior The ipServer does not allow more than 1 "listen" call on a socket; it reports EOPNOTSUPP on subsequent calls. This is inconsistent with SunOS which allows multiple "listen" calls, which is useful for changing the backlog parameter. -- Jim M-S Log-Number: 31286 Date: Sun, 4 Aug 91 17:25:17 PDT From: shirriff (Ken Shirriff) Subject: Compiler bug? When I run this program, the result is "Large": main() { int i; i = -100; if (i<sizeof(int)) { printf("Small\n"); } else { printf("Large\n"); } } Is this a compiler bug or does sizeof not work the way I expect? Ken Log-Number: 31287 Date: Sun, 4 Aug 91 17:33:53 PDT From: eklee (Edward K. Lee) Subject: Re: Compiler bug? >>From shirriff Sun Aug 4 17:25:40 1991 >>Date: Sun, 4 Aug 91 17:25:17 PDT >>From: shirriff (Ken Shirriff) >>To: bugs >>Subject: Compiler bug? >>When I run this program, the result is "Large": >>main() >>{ >> int i; >> i = -100; >> if (i<sizeof(int)) { >> printf("Small\n"); >> } else { >> printf("Large\n"); >> } >>} >>Is this a compiler bug or does sizeof not work the way I expect? >>Ken In an expression consisting on both int and unsigned, the int is coerced to an unsigned. (That's what my C book says.) You will get the desired effect by casting the result of sizeof to an int. Ed Log-Number: 31288 Date: Sun, 4 Aug 91 17:39:45 PDT From: mottsmth (Jim Mott-Smith) Subject: Re: Compiler bug? According to K&&R Ansi edition p. 198, the signed int is coerced to an unsigned only if the type long int cannot represent all unsigned ints. If it can, a signed comparison is done. Yuck. -- Jim M-S Log-Number: 31289 From: mendel (Mendel Rosenblum) Subject: Re: Compiler bug? Date: Sun, 04 Aug 91 18:06:13 PDT > > According to K&&R Ansi edition p. 198, the signed int is coerced > to an unsigned only if the type long int cannot represent all > unsigned ints. If it can, a signed comparison is done. Yuck. > > -- Jim M-S Also from K&R Ansi edition p. 135: "Strictly, sizeof produces an unsigned integer value whose type, size_t, is defined in header <stddef.h>" In the sprite stddef.h we have: typedef int size_t; This is incorrect for ansi C. Note that cc on SunOS and BSD 4.3 have sizeof() return an integer so the program prints what Ken expected. Also, if you compile your program with the -traditional flag with gcc it will treat sizeof() as signed. Mendel Log-Number: 31292 Subject: bug in instrumented lock initialization? Date: Tue, 06 Aug 91 17:22:45 PDT From: Mike Kupfer <kupfer> The "InitDynamic" macros in the kernel sync.h don't initialize the listInfo field (for either semaphores or locks), and a quick scan with "gid" doesn't show anywhere else where they might get initialized. mike Log-Number: 31295 From: mendel (Mendel Rosenblum) Subject: Re: decstation compiler bug Date: Wed, 07 Aug 91 11:30:49 PDT > Return-Path: eklee > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA535358; Wed, 7 Aug 91 11:11:54 PDT > Date: Wed, 7 Aug 91 11:11:54 PDT > From: eklee (Edward K. Lee) > Message-Id: <9108071811.AA535358@sprite.Berkeley.EDU> > To: bugs > Subject: decstation compiler bug > > Some problem having to do with volatile variable declarations. > I get a similar error when I try to compile it on dill using cc. > It compiles without warnings or errors on sparcstations. > > Ed The Ultrix compiler that we used for the decStation on Sprite does not support full ANSI C like gcc does. Volatile variable declarations are part of ANSI C. The MIPS compiler appears to produced a "illegal pointer combination" messages if you try to assign something declared volatile to something not declared volatile. Most of your error messages are due to the variable "fmt_stream" not be volatile while many of the variables assigned to it are volatile. Mendel Log-Number: 31296 Date: Wed, 7 Aug 91 15:36:23 PDT From: tve (Thorsten von Eicken) Subject: bogus load average on ds5000s? On forgery, mayhem, pepper and subversion the load average is >1 but I can't figure out why (i.e. no processes seem to use the cpu). Am I missing something or is migd confused. Past experience is that if I restart migd it shows a reasonable small load average. TvE Log-Number: 31301 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: bogus load average on ds5000s? Date: Thu, 08 Aug 91 10:39:14 +0200 I don't think it's just the ds5000's -- that bug with migd load averages floating up has been around practically forever, and unfortunately i never had enough time to invest to fix it. i did try. probably, if anyone looks at it carefully, it'll be some trivial embarrassing bug.... Fred Log-Number: 31298 Date: Wed, 7 Aug 91 17:11:06 PDT From: tve (Thorsten von Eicken) Subject: LFS problems [coons catalog] /bin/ls -l ~tve/lib/santillana/mss/pn7y8.1/ID0127* /users/tve/lib/santillana/mss/pn7y8.1/ID0127_PN8-27.poem not found -rw-rw-r-- 1 tve 2609 Nov 11 1990 /users/tve/lib/santillana/mss/pn7y8 .1/ID0127_PN8-27.troff Log-Number: 31300 Date: Wed, 7 Aug 91 22:09:18 PDT From: tve (Thorsten von Eicken) Subject: can't rlogin into pepper, but telnet works. Log-Number: 31303 From: mendel (Mendel Rosenblum) Subject: netroute bug - can't change enet address of host Date: Fri, 09 Aug 91 11:03:21 PDT I changed the ethernet address in /etc/spritehosts for treason and ran netroute and the route in allspice's kernel did not change. The same thing happen with sabotage. The only way to change the ethernet address of a route is to reboot allspice. Mendel Log-Number: 31304 From: mendel (Mendel Rosenblum) Subject: Hack in sched mod for register window problem Date: Fri, 09 Aug 91 13:26:24 PDT So people can use their sparc2 without fear of random processes getting killed I added some code to the sched module to flush the regsiter windows before call Proc_SetCurrentProc(). Hopefully this code is temporary and can be removed when the window handlers are fixed. I've included a description of the problem at the end of this message. Mendel >To: mgbaker Subject: Re: Something to watch for In-reply-to: Your message of Sun, 04 Aug 91 22:51:36 -0700. <9108050551.AA472380@sprite.Berkeley.EDU> Date: Tue, 06 Aug 91 18:27:25 PDT >From: mendel I found the problem that caused the "MachHandleWindowUnderflow: killing process!" error. In Sched_ContextSwitchInt() the code sets proc_RunningProcesses[0] (using Proc_SetCurrentProc) before calling Mach_ContextSwitch(). Mach_ContextSwitch() does many save's to spill the windows to the stack. If there is a user window for which the page is nonresident it saves the window into the Mach_State structure pointed to by proc_RunningProcesses[0]. This saves the window into the wrong Mach_State structure; the one that is being switched to rather than the structure of the old process. The underflow error occurs because the handler finds a bogus fp to restore from when the process is switched back in. This also happens on the sparcStation1. The extra window on the sparc2 makes it much more frequent. This problem explains the random tcsh going into the debugger that ethan reported in March (log message 30757) Mendel Log-Number: 31305 Date: Fri, 9 Aug 91 15:14:58 PDT From: theresa@shallot.Berkeley.EDU (Theresa Lessard-Smith) Subject: printer problems 533 Last week, Bob was having troubles printing on lw533, and I believe the Sprite folks fixed the problem. However, what ever the fix was does not all my psroff commands, which worked previously. Do you know what was changed and why psroff does not work any more? Thanks Terry Log-Number: 31307 Date: Sun, 11 Aug 91 13:49:32 PDT From: ouster (John Ousterhout) Subject: Vfork parent returns too soon? I haven't checked the kernel code to verify this, but I suspect that our implementation of vfork isn't correct. In particular, it appears to me that vfork may be returning in the parent before the child has invoked exit or exec. This was causing problems in Tcl, since the parent then modified data structures that were shared with the child. The problem went away when I switched to use fork instead of vfork. -John- Log-Number: 31308 Date: Sun, 11 Aug 91 23:29:46 PDT From: mottsmth (Jim Mott-Smith) Subject: Select looks at 1 too many fd's Sprite's select() call looks at 1 too many bits in the bit map. The man page says it will look at n bits (0 through n-1), but in fact it looks at n+1 bits (0 through n). -- Jim M-S Log-Number: 31309 Date: Mon, 12 Aug 91 11:52:58 PDT From: shirriff (Ken Shirriff) Subject: ipServer in debugger Allspice's ipServer was in the debugger, but it got restarted just as I was about to gdb it. Is there some way I can debug it when it dies, or is this just a reason for core files instead of debug processes? Ken Log-Number: 31310 Date: Mon, 12 Aug 91 17:37:30 -0700 From: sullivan@postgres.Berkeley.EDU (Mark Sullivan) Subject: recvfrom on dgram socket Two processes, server and client, are communicating using UDP messages. The message exchange goes on for an arbitrary amount of time, then server dies and is restarted. When server restarts, it takes over the same port addresses of its predecessor. Client sends another message to server and server fails on the recvfrom() system that reads in client's message. When the recvfrom() fails, errno is set to ECONNREFUSED -- connection refused. This errno is only valid for connect() system calls; it makes no sense for recvfrom to fail because of a refused connection. The program contains no calls to connect(). The messages involved are all UDP. I have been able to reproduce this bug only when going from sabotage to shangri-la (ULTRIX 3.0). THe program works fine from sabotage to kvetching, works fine from ULTRIX to ULTRIX. I pretty certain the bug has occurred when sabotage was either client or server, but I am certain it occurs when sabotage is the server. Mark Log-Number: 31312 Date: Tue, 13 Aug 91 11:40:42 PDT From: shirriff (Ken Shirriff) Subject: Allspice crashed with disk full Last night allspice crashed after the disk filled, with a bunch of: CreateFile: unwinding errors and then Fatal Error: Mem_Free: storage block already free. Log-Number: 31317 From: mendel (Mendel Rosenblum) Subject: Re: Allspice crashed with disk full Date: Wed, 14 Aug 91 08:32:58 PDT > Date: Tue, 13 Aug 91 11:40:42 PDT > From: shirriff (Ken Shirriff) > Message-Id: <9108131840.AA797261@sprite.Berkeley.EDU> > To: bugs > Subject: Allspice crashed with disk full > > Last night allspice crashed after the disk filled, with a bunch of: > CreateFile: unwinding errors and then > Fatal Error: Mem_Free: storage block already free. There appears to be some very dangerous duplicate use of memory going on when a file system fills and file creates are aborted. It appears that two handles have the same memory for their in memory copy of their Fsdm_FileDescriptor (inode). This causes the LFS /user6 to crash the next morning because an inode was written out with block pointers pointing at the blocks of another file. We should change the CreateFile message to a panic(). The way it is currently, it crashes soon or later but also risks corrupting the file system. Mendel Log-Number: 31314 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 13 Aug 1991 18:30:16 PDT Subject: deleteuser is evil The deleteuser program totally screwed up /etc/passwd. Both /etc/passwd and /etc/master.passwd were wrong. Here's what happened: loiter<jhh 131> deleteuser dc This program will delete the accounts and erase all the files in the home directories. Are you sure you want to do this? (y or n) y Remove dc from the aliases file? (y or n) y /sprite/lib/sendmail/RCS/aliases,v --> /sprite/lib/sendmail/aliases co error: revision 1.182 already locked by tve Warning: unable to remove `dc' from the aliases file. You'll have to edit the aliases file by hand. removing symbolic link: /users/dc removing home directory: /user1/dc Removing dc from /etc/master.passwd. Cannot rename /etc/ptmp.dir: no such file or directory Cannot rename /etc/ptmp.pag: no such file or directory And here is some of what /etc/passwd contained: loiter<jhh 138> more /etc/passwd root:DD7T2qZNqsPlU:0:1:(NULL):(NULL):(NULL) daemon:*:1:1:(NULL):(NULL):(NULL) nobody:*:-2:-2:(NULL):(NULL):(NULL) sys:*:2:2:(NULL):(NULL):(NULL) bin:*:3:3:(NULL):(NULL):(NULL) ftp:*:4:5:(NULL):(NULL):(NULL) guest:*:5:5:(NULL):(NULL):(NULL) newuser:*:9998:255:(NULL):(NULL):(NULL) clear:*:100:100:(NULL):(NULL):(NULL) andrew:*:543:155:(NULL):(NULL):(NULL) boothe:*:1491:116:(NULL):(NULL):(NULL) John Log-Number: 31315 Subject: Re: deleteuser is evil Date: Tue, 13 Aug 91 23:57:28 PDT From: Mike Kupfer <kupfer> Sigh. I thought I tested deleteuser using a real user, but maybe I somehow botched it. I would look at the mkpasswd invocation and the getpwent() calls as places where trouble might be happening. mike Log-Number: 31318 Date: Wed, 14 Aug 91 10:49:08 PDT From: sullivan (Mark Sullivan) Subject: sysV semaphore bug (a) proc A creates a semaphore. proc B attaches to the same semaphore. (b) proc A tries to acquire the semaphore and blocks. (c) proc B releases the semaphore, but A never wakes up. -- checked that the semaphore value was correct. After B releases the semaphore, the value is 1. Sprite allows B to reacquire the semaphore after releasing it. -- If A receives a signal after B releases the semaphore, A returns from the semop and acquires the semaphore. It does not fail with EINTR when interrupted by a signal if the semaphore value is 1. The following is a program that can be used to reproduce the bug: #include <sys/types.h> #include <sys/ipc.h> #include <sys/sem.h> #include <stdio.h> /* two arguments: proc_name and key_value */ main(argc,argv) int argc; char *argv[]; { struct sembuf sops; int c; char *name = argv[1]; int semid; key_t key = atoi(argv[2]); /* * must create or attach to semaphore before * acquiring, releasing, or getting the value * of the semaphore. * * Invalid input is ignored. */ for (;;) { char str[132]; printf(">>>>> "); fflush(stdout); if (! gets(str)) exit(0); switch (*str) { /* create sem with given key */ case 'c': semid = semget (key, 1, IPC_EXCL|IPC_CREAT|0666); if (semid < 0) { perror("creating semaphore"); } break; /* attach to sem with given key */ case 'g': semid = semget(key, 1, 0666); if (semid < 0) { perror("semget existing semaphore"); } break; /* print current sem value */ case 'v': printf("%s: sem value %d\n",name, semctl(semid, sem, GETVAL, NULL)); break; /* acquire semophore */ case 'a': sops.sem_num = 0; sops.sem_op = -1; sops.sem_flg = 0; printf("%s: acquiring sema (%d,%d)\n",name,key,semid); if (semop(semid,&sops,1)<0) { perror("semop acquire"); } break; /* destroy */ case 'd': if (semctl(semid, 0, IPC_RMID, 0) <0) { perror("destroying semaphore"); } break; /* release semaphore */ case 'r': sops.sem_num = 0; sops.sem_op = 1; sops.sem_flg = 0; printf("%s: releasing sema (%pd,%d)\n",name,key,semid); if (semop(semid,&sops,1)<0) { perror("semop release"); } break; /* ignore all other inputs */ default: continue; } printf("%s: done semop (%d,%d)\n",name,key,semid); } } Log-Number: 31320 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 14 Aug 1991 18:08:22 PDT Subject: prefix install broken It is possible to install a totally bogus prefix, ie a prefix for a path that doesn't exist. In my case I installed "t/t4" instead of "/t/t4". Once done, the prefix cannot be deleted. I don't see why it should be possible to install a prefix that doesn't have a corresponding remote link. In any case it should check that the prefix starts with "/". John Log-Number: 31321 Subject: allspice crash: descriptor map foulup Date: Wed, 14 Aug 91 22:07:26 PDT From: Mike Kupfer <kupfer> Allspice died this evening with Fatal Error: Descriptor map foulup, can't find file 66084 at 127761. It was running the 1.096 kernel. Apparently the filesystem (/sprite/src/kernel?) was enough screwed up that John H. had to bring up allspice without mounting it. The core file is /export1/cores/allspice.descMapFoulup Two questions: (1) How do we get at the core files, now that they're in /export1? Do we need to add /export1 to the list of filesystems that we import from ginger? (2) There are core files in /expor1/cores from early June, and /export1 is starting to fill up. I will delete all the June core files tomorrow unless someone gives me good reason not to. mike Log-Number: 31327 Date: Fri, 16 Aug 91 00:27:29 PDT From: kupfer@ginger.Berkeley.EDU (Mike Kupfer) Subject: allspice crash: descriptor map foulup Allspice died with another Fatal Error: Descriptor map foulup, can't find file 95330 @ 92032 This is with the 1.096 kernel. The core file is in /home/ginger/cores/allspice.descFoulup2. mike Log-Number: 31325 Subject: allspice crash: DMA bus error -> LFS short read Date: Thu, 15 Aug 91 12:35:00 PDT From: Mike Kupfer <kupfer> I was reinitializing a dump tape on allspice when it crashed with Warning: SCSI3#3 DMA bus error Fatal Error: LfsError: on /swap1 status 0x1, LfsReadBytes short read We rebooted w/o taking a core file. Mendel says this has happened to him before. mike Log-Number: 31326 Subject: dump-related documentation out-of-date Date: Thu, 15 Aug 91 12:58:21 PDT From: Mike Kupfer <kupfer> The documentation for doing dumps has not tracked changes to the dump scripts. The documentation involved includes the man pages for dailydump and weeklydump, as well as /sprite/admin/howto/doADump. The changes that aren't documented (or are only partially documented) include: - the use of /sprite/admin/dump/dumpalias as the "dumper" alias, and how/when to change it - new arguments to the dailydump script - how errors are handled when doing dumps - the use of a lock file to disable daily dumps mike Log-Number: 31328 Date: Fri, 16 Aug 91 08:15:13 PDT From: bmiller (Bob Miller) Subject: allspice down this morning Allspice was down when I came in this morning... Fatal error: Descriptor map foulup, can't find file 25001 at 92032 Entering debugger with a Interrupt Trap (16) exception at PC 0xf60c397c Log-Number: 31330 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 16 Aug 1991 11:18:27 PDT Subject: mopd printing error messages Mopd is printing the following messages to allspice's syslog: [Fri Aug 16 11:02:03 1991]: Out of order function: 10 I have no idea what this means. John Log-Number: 31332 Date: Fri, 16 Aug 91 11:50:09 PDT From: shirriff (Ken Shirriff) Subject: Re: mopd printing error messages I've been having some problems with mopd too. I suspect that allspice is responding too slowly and gets out of step with the client. This is a reason not to run mopd on allspice. Ken Log-Number: 31333 Date: Fri, 16 Aug 91 14:22:33 PDT From: elm (ethan miller) Subject: one of my mail messages (corrupted) ------- Start of forwarded message ------- X-VM-Attributes: [nil nil nil nil nil] Status: RO Received: from ncar.ucar.edu by sprite.Berkeley.EDU (5.59/1.29) id AA265784; Thu, 15 Aug 91 10:23:56 PDT Received: from niwot.scd.ucar.edu by ncar.ucar.EDU (5.65/ NCAR Central Post Office 04/10/90) id AA29210; Thu, 15 Aug 91 11:22:17 MDT Received: from elbert.scd.ucar.edu by niwot.scd.ucar.EDU (5.65/ NCAR Mail Server 04/10/90) id AA01015; Thu, 15 Aug 91 11:22:14 MDT Date: Thu, 15 Aug 91 11:22:11 MDT Message-Id: <9108151722.AA00658@elbert.scd.ucar.edu> Received: by elbert.scd.ucar.edu (5.65/ NCAR Mail Client 04/19/90) id AA00658; Thu, 15 Aug 91 11:22:11 MDT Subject: Re: procstat >From: djc@niwot.scd.ucar.EDU (Dennis Colarelli) To: elm@sprite.Berkeley.EDU [Normal mail message deleted] [Weirdness starts here:] Broadcasting for server of "/sprite/src/kernel" Importing "/sprite/src/kernel" from allspice <18>Aug 15 11:41:26 sendmail[12c1e]: AA601111: SYSERR: net timeout: connection timed out during greeting wait with sparc.berkeley.edu <19>Aug 15 11:41:26 sendmail[12c1e]: pattrsn@sparc.Berkeley.EDU... reply: read error 8/15/91 11:41:59 catnip (48) rebooted 8/15/91 11:47:58 catnip (48) rebooted Broadcasting for server of "/home/ginger/cores" Importing "/home/ginger/cores" from lust <stat> 8/15/91 12:17:46 allspice (14) RPC timed-out get attr of "/usr/spool/mail/pmchen" waiting for recovery <close> 8/15/91 12:18:01 allspice (14) RPC timed-out <stat> 8/15/91 12:18:07 allspice (14) RPC timed-out get attr of "/sprite/lib/cron/crontab" waiting for recovery <remove> 8/15/91 12:18:09 allspice (14) RPC timed-out remove of "/tmp/Ex35621" waiting for recovery <write> 8/15/91 12:18:22 allspice (14) RPC timed-out 8/15/91 12:18:22 allspice (14) RmtFile "logfile.analnew" <7,2119> Write-back failed: rpc timeout <write> 8/15/91 12:18:28 allspice (14) RPC timed-out 8/15/91 12:18:28 allspice (14) RmtFile "/sprite/syslogs/mustard.Berkeley.EDU/syslog.out" <10,2156> Write-back failed: rpc timeout RpcDoCall: <write> RPC to roar is hung RpcDoCall: <io control> RPC to roar is hung 8/15/91 12:22:04 allspice (14) rebooted <open> 8/15/91 12:24:12 allspice (14) RPC timed-out open of "/hosts/mustard.Berkeley.EDU/netTCP" waiting for recovery <close> 8/15/91 12:27:05 allspice (14) RPC timed-out <close> 8/15/91 12:27:12 allspice (14) RPC timed-out <close> 8/15/91 12:27:18 allspice (14) RPC timed-out 8/15/91 12:30:03 allspice (14) rebooted <io control> RPC exit 0x70003 <write> RPC exit 0x40007 <30>Aug 15 12:30:20 migd[12c18]: Write to global daemon timed out. 8/15/91 12:30:21 allspice (14) RmtFile "/sprite/admin/migd/mustard.Berkeley.EDU.log" <10,9568> : stale handle 8/15/91 12:30:21 allspice (14) - recovering handles 8/15/91 12:30:21 allspice (14) RmtFile "/sprite/admin/migd/mustard.Berkeley.EDU.log" <10,9568> : stale handle 8/15/91 12:30:32 allspice (14) Client backing off again from negative ack. 8/15/91 12:30:49 allspice (14) Recovery complete 316 handles reopened 1291 failed reopens Fsprefix_OpenCheck waiting for recovery Fsprefix_OpenCheck ok 8/15/91 12:34:55 sedition (68) rebooted 8/15/91 12:58:03 raid1 (77) rebooted Broadcasting for server of "/user1" Importing "/user1" from allspice 8/15/91 13:15:23 catnip (48) rebooted RpcDoCall: <remove> RPC to allspice is hung <remove> RPC ok 8/15/91 13:39:50 treason (53) rebooted 8/15/91 13:45:24 treason (53) rebooted LE ethernet: Received packet with CRC error. 8/15/91 13:57:16 joyride (74) rebooted PdevWrite: signalFrom daemon Thu Aug 15 15:09:45 1991 Received: by sprite.Berkeley.EDU (5.59/1.29) id AA69213; Thu, 15 Aug 91 15:07:20 PDT Date: Thu, 15 Aug 91 15:07:20 PDT >From: root (The Sprite God) Message-Id: <9108152207.AA69213@sprite.Berkeley.EDU> To: root Subject: Files in lost+found You have files in the following lost+found directories. These files were recovered during reboot. Please examine the following directories and recover or delete your files. //lost+found ------- End of forwarded message ------- Any idea how my syslog could end up in a mail file? I do use emacs to read my mail, but I can't see how I could get the syslog file included. The emacs code for mail reading doesn't even know the syslog exists. ethan Log-Number: 31334 Date: Fri, 16 Aug 91 16:47:08 PDT From: elm (ethan miller) Subject: problem with bibtex Some of the constants for bibtex are set too low. In particular, they only allow 1000 characters per entry. If the bibliography entry includes an abstract, it will often exceed this limit. Also, there is a limit of 65000 characters in the bibliographies. Any chance we can get a "bigger" version of bibtex? If no one objects, I'll work on this myself. ethan Log-Number: 31336 Date: Sat, 17 Aug 91 10:12:08 PDT From: mendel@ginger.Berkeley.EDU (Mendel Rosenblum) Subject: allspice crash - out of memory Allspice was down this morning because the kernel ran out of memory. I took a core file "allspice.817". I belive that the kgcore on ginger contains the fix that allows cores of out of memory errors to be debugged. One thing I did notice was the compat kernel only allowed the kernel to grow to 32 megabytes. Previous kernels such as the current "new" kernel allow upto 40 megabytes. Does anyone know why this change was removed? Mendel Log-Number: 31337 Date: Sun, 18 Aug 91 12:39:57 PDT From: pmchen (Peter M. Chen) Subject: discrepancy between df and du On /user4, I noticed that we suddenly went from about 50% used to 88% used. In looking at this, I took a du /user4/* (results in ~pmchen/tmp/du8.18). I totalled the sizes at the top level (/user4/*), found in ~pmchen/tmp/du8.18.sort and found that only 359 MB were used. Df reported that 488 MB were used. I forgot to specify the -a option to du, but even so, df is off by over 100 MB. Pete Log-Number: 31345 Date: Wed, 21 Aug 91 13:10:15 PDT From: margo (Margo Seltzer) Subject: File system filling prematurely /postlfs reports 111% utilization with 69631 blocks used out of 69632 available. However, a du on /postlfs returns: 1 ./lost+found 18533 ./margo/user 882 ./margo/kernel 19416 ./margo 47 ./marks/data/files 13 ./marks/data/base/.postDb.6 291 ./marks/data/base/.postDb.2 1 ./marks/data/base/.postDb.1 1839 ./marks/data/base/.postDb.1517 2149 ./marks/data/base 2197 ./marks/data 3458 ./marks 22876 . - Margo Log-Number: 31350 Subject: L1-N should provide feedback Date: Thu, 22 Aug 91 17:19:37 PDT From: Mike Kupfer <kupfer> Allspice was having a long series of "Reinit recv unit". I tried Break-N a bunch of times, but there was little sign that anything was happening, except for a few "receiver overrun" messages that appeared some seconds after I hit Break-N. Given the unreliability of the Break-foo mechanism, I think that whatever code is responsible for resetting the network should also do some sort of printf to verify that it has been invoked. If the reset takes more than one second, it should also tell when it is finished. mike Log-Number: 31351 From: mendel (Mendel Rosenblum) Subject: Re: L1-N should provide feedback Date: Thu, 22 Aug 91 17:30:42 PDT > > Allspice was having a long series of "Reinit recv unit". I tried > Break-N a bunch of times, but there was little sign that anything was > happening, except for a few "receiver overrun" messages that appeared > some seconds after I hit Break-N. Given the unreliability of the > Break-foo mechanism, I think that whatever code is responsible for > resetting the network should also do some sort of printf to verify > that it has been invoked. If the reset takes more than one second, it > should also tell when it is finished. > > mike The "reinit recv unit" just means that allspice got overrun with packets. Resetting the network interface on allspice is a bad idea. There is a bug in the network that hangs the machine for several seconds during a net reset. The causes the clients to timeout and go thru recovery with allspice, further overloading the machine. Mendel Log-Number: 31352 Subject: allspice crash: / filled up Date: Thu, 22 Aug 91 17:59:28 PDT From: Mike Kupfer <kupfer> The root partition on allspice filled up, so it crashed with a hashing error in the fscache code. We rebooted without taking a core dump. mike Log-Number: 31353 Subject: fluff in root partition Date: Thu, 22 Aug 91 19:17:10 PDT From: Mike Kupfer <kupfer> I was wondering where all the space was going in the root partition. While cleaning up (mostly deleting mongo log files in /sprite/admin), I noticed some odds and ends that I thought we could get rid of. They won't win back much space, but it would reduce the clutter. These just seem old and unused: /dev.old (355KB) /sprite/admin/mig.stats (349KB) /sprite/boot/sun4.md/{alc,atc} (800KB each) /sprite/boot/sun4c.md/brian (700KB) /sprite/cmds.ds3100/{kgdb,kmsg,sh}.old /sprite/cmds.sun3/{gdb,kgdb,kmsg}.old /sprite/cmds.sun4/{kgdb,kmsg}.old /sprite/doc/ref.ancient These seem like they were put there because of some filesystem problem, but nobody ever removed them: /tmp/bad /tmp.bad /sprite/BADFILES /sprite/trashed/* So can I just nuke all this stuff, or...? mike Log-Number: 31356 From: mendel (Mendel Rosenblum) Subject: sigcontext missing for decstation Date: Fri, 23 Aug 91 12:54:10 PDT The definition of struct sigcontext was moved from signal.h into "machSignal.h" for each machine type. The problem is there is no machSignal.h for the decstation. Mendel Log-Number: 31359 Subject: load on allspice Date: Fri, 23 Aug 91 16:58:35 PDT From: Mike Kupfer <kupfer> Allspice has gotten a serious case of the slows both this afternoon and yesterday afternoon. The problems eventually went away by themselves, but it sure was painful trying to get things done for awhile. I wonder if having /tmp, /swap1, and /sprite/src/kernel all on allspice is partly responsible for the performance problems. mike Log-Number: 31360 Date: Fri, 23 Aug 91 17:01:39 PDT From: mottsmth (Jim Mott-Smith) Subject: Re: load on allspice It seems to me the fundamental problem is that we've doubled (roughly) all the clients' horsepower (ds3100->ds5000, sparc1->sparc2) without any corresponding increase for allspice. -- Jim M-S Log-Number: 31361 Subject: allspice crash: DMA bus error -> LFS short read panic Date: Sat, 24 Aug 91 21:42:13 PDT From: Mike Kupfer <kupfer> Allspice died this evening while doing dumps. Here's what was on the console: Warning: SCSI3#3 DMA bus error Fatal Error: LfsError: on /sprite/src/kernel, status 0x1 LfsReadBytes short read It was running the 1.098 kernel. We rebooted with the 1.099 kernel. mike Log-Number: 31362 Subject: allspice crash(es): SCSI problems? Date: Sun, 25 Aug 91 18:31:35 PDT From: Mike Kupfer <kupfer> When I came in this evening, allspice was down. The latest couple messages on the console were Warning: SCSI3#1 unable to select target SCSI3#1 Target 4 LUN 0 8/25/91 11:32:33 broadcast (0) File "(NULL)" <10,0> Write-back failed: cacheable/busy conflict Fatal Error: LfsError: on /scratch4 status 0x1, LfsWriteBytes short write The kernel was 1.099, and the core file is /home/ginger/cores/allspice.scsiSelect. I rebooted allspice from ginger. Around the time that allspice figures out where its root is, there was a very long pause. I hit L1-r and L1-v a couple times but got no response. Eventually I saw Warning: SCSI3#1 DMA register conflict goof Warning: SCSI3#1 Target 0 LUN 0 reset and current command terminated. There then followed a bunch of Ofs complaints about being unable to write back various bits of information, and allspice went back into the debugger. I reset allspice (using the reset switch in the back) and booted the "new" kernel again. Should I have saved a core file? mike Log-Number: 31363 Subject: more SCSI problems on allspice Date: Sun, 25 Aug 91 19:06:35 PDT From: Mike Kupfer <kupfer> Allspice died again, this time while dumping /scratch4. Warning: SCSFatal Error: LfsError: on /scratch4 status 0x70000, LfsReadBytes failed This is with the 1.099 kernel. There's a core file in /home/ginger/cores/allspice.status70000. I tried to reboot "sprite", only to see the system hang. After booting "old" (1.096), I discovered that the 1.098 kernel was installed as "compat" and "vmsprite". Am I confused, or is the correct name "sprite" and not "vmsprite"? Also, how hard would it be to change the disk boot program to say "``sprite'' not found" instead of just hanging? mike P.S. /home/ginger/cores is getting full again. Log-Number: 31365 From: mendel (Mendel Rosenblum) Subject: Re: more SCSI problems on allspice Date: Sun, 25 Aug 91 19:41:13 PDT > I tried to reboot "sprite", only to see the system hang. After > booting "old" (1.096), I discovered that the 1.098 kernel was > installed as "compat" and "vmsprite". Am I confused, or is the > correct name "sprite" and not "vmsprite"? Also, how hard would it be > to change the disk boot program to say "``sprite'' not found" instead > of just hanging? Well, when I was doing the kernel swap there was no "sprite" in /allspiceA. The current "sprite" was named "vmsprite" there. I just made the old "compat" kernel "vmsprite". I suspect this was done because there is code in the disk boot program to default to "vmsprite" if no filename is specified. Anyway, it probably doesn't work. Mendel Log-Number: 31367 Date: Mon, 26 Aug 91 01:01:14 -0700 From: dlong@cse.ucsc.edu Subject: Re: more SCSI problems on allspice > Allspice died again, this time while dumping /scratch4. > > Warning: SCSFatal Error: LfsError: on /scratch4 status 0x70000, > LfsReadBytes failed > > This is with the 1.099 kernel. There's a core file in > /home/ginger/cores/allspice.status70000. > > I tried to reboot "sprite", only to see the system hang. After > booting "old" (1.096), I discovered that the 1.098 kernel was > installed as "compat" and "vmsprite". Am I confused, or is the > correct name "sprite" and not "vmsprite"? Also, how hard would it be > to change the disk boot program to say "``sprite'' not found" instead > of just hanging? Hmmm, I thought I already reported that, but maybe I just sent a note to Mendel and Mary. Anyway, there is a bug in the fs code that I had to fix to get diskboot it to work for the sun4c, and I guess the fix never made it to the sun4 diskboot. Just do a diff between sunprom/fs.c and diskBoot.OpenProm/fs.c and you'll see the bug fix. dl > > mike > > P.S. /home/ginger/cores is getting full again. Log-Number: 31370 Date: Mon, 26 Aug 91 21:50:47 PDT From: dlong@dogwood.ucsc.edu (Dean Long) Subject: bug in as for sparc You might think "set 4,%o1" and "mov 4,%o1" would do the same thing, but they don't. The "set" pseudo-instruction *always* turns into two instructions: a "sethi" followed by an "or". The "mov" pseudo-instruction turns into a single "or" instruction, ignoring all but the low 12 bits of the constant. dl Log-Number: 31374 Date: Tue, 27 Aug 91 16:21:49 PDT From: ouster (John Ousterhout) Subject: Mail wedged Mail doesn't seem to be getting into Allspice. Can someone unwedge it? Thanks. -John- Log-Number: 31375 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 28 Aug 1991 11:26:39 PDT Subject: allspice reboot problems due to prefix Allspice crashed this morning due to an LFS bug (I'll leave it to Ken or Jim to give us more details on that). We had problems rebooting allspice due to the newly installed "prefix" command. The new prefix stats a remote link to make sure it exists when you install a prefix. Unfortunately, it appears that stat follows the link, which doesn't exist because it hasn't been installed yet. Prefix should be modified to only stat the remote link itself, not what it points to. I think the call to stat has to be changed to a call to Fs_GetAttributes, but that's just a guess. John Log-Number: 31376 Date: Wed, 28 Aug 91 18:05:58 -0700 From: margo@postgres.berkeley.edu (Margo Seltzer) Subject: makedepend hanging I've been rebuilding kernel modules on decstations (a variety of machines), and periodically makedepend will hang seemingly infinitely (it hung for 2 hours this afternoon while running on pepper). Killing it and rerunning it seems to work. I had this problem about 2 months ago and it went away, only reappearing very recently. - M Log-Number: 31377 Date: Thu, 29 Aug 91 08:34:12 PDT From: ouster (John Ousterhout) Subject: Re: makedepend hanging I've also noticed makes hanging from time to time. When this happens I've found that I can control-Z and continue them, and it unwedges them. Could this be related in some way to the new compatibility kernel, with its new way of handling signals and migration? -John- Log-Number: 31380 Subject: file hole isn't zero filled Date: Thu, 29 Aug 91 12:04:04 PDT From: Mike Kupfer <kupfer> The man page for lseek says that if you seek beyond the end of a file and write to it, the intervening hole will be logically zero-filled. (That is, if you read from the hole, you'll get all zeroes.) In Sprite this only partly works. If you read the hole while the file is still open, you get zeros. If you close the file, reopen it, and read the hole, you get garbage. mike Log-Number: 31388 From: mendel (Mendel Rosenblum) Subject: Re: file hole isn't zero filled Date: Fri, 30 Aug 91 11:33:57 PDT > In Sprite this only partly works. If you read the hole while the file > is still open, you get zeros. If you close the file, reopen it, and > read the hole, you get garbage. > mike There are several bugs causing this. Most of the problems stem from delayed writes, attribute caching, and the Sprite implementation of attributes. On the Sprite file server attributes such as the lastByte (ie size) of a file are kept in two locations and updated by two different calls. The "lastByte" value stored in the cacheInfo structure are updated when a client closes a file. During the close RPC the attributes that are cached on the client such as the access time, modify time, and lastByte are updated in the server cache. The "lastByte" is also kept in the Fsdm_FileDescriptor on disk. The in-memory copy of this structure is updated when writes are done to the file. That is when write RPCs make it over to the file server the lastByte in the Descriptor gets updated so it includes this block. What happens in the common case is when the file is closed the lastByte in the cacheInfo get sets to the correct value while the one in the Fsdm_FileDescriptor is still -1 (size == 0). The Fsdm_FileDescriptor get updated when the delayed writes happen. The garbage gets generated due to a bug in the BlockRead procedure for both OFS and LFS (Not surprising since LFS is just a copy of the OFS code for this condition.) The bug happens when a read occurs at an offset pass the lastByte (in the Fsdm_FileDescriptor) of the file. Both storage managers just return SUCCESS and FS_BLOCK_SIZE bytes transferred and don't zero by cache block. I put a patch in both OFS and LFS to zero the cache blocks so the applications with get zeros rather than garbage. There is also a bug in the fscache module that returns the wrong thing when this happens. I fixed that too. Mendel Log-Number: 31386 Date: Fri, 30 Aug 91 08:51:37 PDT From: ouster (John Ousterhout) Subject: Lust crash Lust was in the debugger when I came in this morning: LfsError: on /pcs/vlsi status 0x1, LfsReadBytes short read I rebooted it. -John- Log-Number: 31387 From: mendel (Mendel Rosenblum) Subject: Re: Lust crash Date: Fri, 30 Aug 91 11:32:41 PDT > Subject: Lust crash > > Lust was in the debugger when I came in this morning: > > LfsError: on /pcs/vlsi status 0x1, LfsReadBytes short read > > I rebooted it. > -John- Was there any other message before this one? This means that LFS did a read that returned SUCCESS did not read as many bytes as requested. If there is a message before this that looks something like: DevRawBlockDevRead: error 0x0 inLength NN at offset NN outLength XXX then it probably was an LFS error that tried to read outside the block range of the disk partition. I checked the file system /pcs/vlsi and there are no bogus block pointer that would cause the type of error. Other other possibility is that there is a problem in the new SCSI or dev module code. Too bad we can't kgcore decstations. Mendel Log-Number: 31393 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 3 Sep 1991 12:34:00 PDT Subject: crash due to exec from pfs My machine crashed when it tried to reclaim a segment that was used previously as a text segment for something I ran from the sww. Evidently the sticky segment stuff doesn't work correctly with pseudo-filesystems. The sticky segment ends up with a dangling reference to a file handle that has been reused. When the kernel tries to reclaim the segment it barfs on the contents of the file handle. John Log-Number: 31396 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 4 Sep 1991 17:48:31 PDT Subject: installall of scripts broken If one does a "pmake installall" for a script it should be installed for all machine types. Currently it will install for some types, but not others. On a decstation it will install for the decstation and sun3, but not sun4. Here is a transcript of my latest attempt. John loiter<jhh 117> pmake installall --- installsun3 --- pmake -l 'INSTALLDIR=/local/cmds' -k TM=sun3 install --- install --- Updating: /local/cmds.sun3/scvs --- installds3100 --- pmake -l 'INSTALLDIR=/local/cmds' -k TM=ds3100 install --- install --- Updating: /local/cmds.ds3100/scvs loiter<jhh 118> pmake install TM=sun4 --- .BEGIN --- you cannot compile for a sun4 on this machine exit 1 *** Error code 1 --- install --- Updating: /local/cmds.sun4/scvs Log-Number: 31397 Subject: Re: installall of scripts broken Date: Wed, 04 Sep 91 18:04:31 PDT From: Mike Kupfer <kupfer> The sun4 ld that runs on DECstations is broken (unless somebody has fixed it since I looked at it in early May). I thought it would be simpler to disable all sun4 stuff for DECstations, rather than just disabling ld. I'm willing to be outvoted, though. Of course, the real solution is to fix ld. Sad to say, it looked pretty swamp-like to me, with bits and pieces of various source trees glued together in some arcane fashion. Maybe the thing to do is start from scratch, using the most recent ld release from the FSF. mike Log-Number: 31401 Date: Fri, 6 Sep 91 08:17:08 PDT From: bmiller (Bob Miller) Subject: allspice down Allspice was down when I came in this morning... Fatal Error: LfsError: on /scratch1 status 0x1, LfsReadBytes short read Entering debugger with a Interrupt Trap (16) exception at PC 0xf60ca9dc Log-Number: 31403 From: mendel (Mendel Rosenblum) Subject: Re: signal.h: cannot find machSignal.h Date: Fri, 06 Sep 91 09:50:22 PDT > > Subject: signal.h: cannot find machSignal.h > > > When I compile a file which includes signal.h, I get complaints about not > being able to find machSignal.h > > - M The problem appears to be that signal.h now includes machSignal.h and machSignal.h is in /sprite/lib/include/$(TM).md/sys which is not in the normal include path for the c compiler. Until this is fix you might be able to compile things by adding a -I/sprite/lib/include/machine/sys to your CFLAGS. Mendel Log-Number: 31405 From: mendel (Mendel Rosenblum) Subject: Re: signal.h: cannot find machSignal.h Date: Fri, 06 Sep 91 10:18:32 PDT > > I thought that the solution we agreed upon for this problem was > to put in symbolic links for machine-dependent include files. > Can someone do this for machSignal.h? > > -John- I couldn't remember what we agreed on. I created a symbolic link named machSignal.h in /sprite/lib/include that points to machine/sys/machSignal.h. Someone speak up if this is not the right thing. Mendel Log-Number: 31407 Subject: bogus process locking in Proc_NewProc? Date: Fri, 06 Sep 91 18:55:23 PDT From: Mike Kupfer <kupfer> ProcGetUnusedPCB returns a locked PCB entry. Proc_NewProc then promptly unlocks it with procPtr->genFlags = procType; (rather than ORing in the process type). Am I missing something here, or is this a bug? mike Log-Number: 31408 Subject: raid1 crash: level 15 Memory Interrupt Date: Fri, 06 Sep 91 19:31:32 PDT From: Mike Kupfer <kupfer> raid1 died about an hour ago with Memory Interrupt (level 15) (31) exception at PC 0xf60d910c Mendel started looking at it but then it freaked out, so I rebooted it. mike Log-Number: 31410 From: mendel (Mendel Rosenblum) Subject: Bug in scsi disk error reporting Date: Sat, 07 Sep 91 14:36:44 PDT With all the shuffling in the device module the scsi disk code lost the ability to detect and report errors. The problem is that when DiskDoneProc in devSCSIDisk.c is called and there is an error the SCSI sense data is not passed correctly to the DiskError() routine. Instead the DiskError() routine uses the sense buffer that has added to the ScsiDisk structure which doesn`t contain the correct data. The disk /dev/rsd01c on allspice (/scratch1 file system) has a media error at sector 233107 that had been crashing allspice but it was not ever reported in the syslog. Mendel Log-Number: 31413 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 7 Sep 1991 16:58:16 PDT Subject: Re: Bug in scsi disk error reporting I'll take responsibility for this one. When I was working on the exb8500 driver I cleaned the scsi stuff up somewhat. I didn't feel like totally rewriting the dev module, so the HBA code uses the old way of handling the scsi stuff. Unfortunately I botched the interaction between the two a bit. The real solution is to rewrite the HBA stuff, but I don't want to do that right now because the kernel is in a state of flux. Instead I've added a bcopy that should fix the problem. John Log-Number: 31414 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 7 Sep 1991 17:40:26 PDT Subject: dumps did not complete I've noticed that the dumps did not complete on Friday morning or Saturday morning. Further investigation revealed that a "skip to end of data" command sent to the device is not completing. I'm unable to get into the machine room to look at the tape drive because my card key is non-functioning. I think I'll set up a drive in 608-2 so Ken can do the full dumps. John Log-Number: 31415 Date: Sun, 8 Sep 91 14:10:53 PDT From: shirriff (Ken Shirriff) Subject: Allspice lfs crash Allspice crashed with LfsSetSegUsage called on clean segment (1083) The core is in vmcore.lfs Log-Number: 31419 Date: Tue, 10 Sep 91 00:38:15 PDT From: shirriff (Ken Shirriff) Subject: /hosts/raid1/dev is poison The directory /hosts/raid1/dev is poison: an ls in there will wedge up, and the dumps die in there. Thus, I can't dump /. Since I've wedged up the exabyte on exabyte and murder, I don't know if I can dump anything else either. Log-Number: 31420 From: mendel (Mendel Rosenblum) Subject: /boot problem - No permission checking on ftruncate Date: Tue, 10 Sep 91 11:00:13 PDT The ftruncate will let an user truncate any open object regardless of access modes or object type. It is possible to ftruncate a file opened in read-only mode. Even worse, it is possible to ftruncate an open directory. This messes up the file system. I suspect that this is what murder did to /boot. When allspice reboots the contents of "/boot? will reappear in lost+found. Mendel Log-Number: 31421 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 10 Sep 1991 11:41:02 PDT Subject: new dump.new I've fixed one bug with dump.new. If you specified the "-r" or "-s" option, and the tape was in the old format the new label would be written in the old format as well. Unfortunately the old format does not work on the new drives, so dump would die after dumping one file system. I've installed a new dump.new that fixes this problem. I still don't know about the reported problems with the tape drives hanging. I've been unable to repeat the problem. John Log-Number: 31423 Date: Tue, 10 Sep 91 17:50:42 PDT From: pmchen (Peter M. Chen) Subject: consistency problems? I've been getting some weird results with edit-compile cycles. I edited a file (changing a #define constant), issued pmake, then re-ran. The compile of the right file actually took place. But, when I re-ran, it ran like the old binary. When I re-recompiled, it worked fine. This whole cycle was repeated twice. Pete ps. this was on mustard, a ds5000. The offending directory was in ~/bench/specWl. I changed the constant CAPACITYEXPANSION in specWl.h from 2 to 4. Log-Number: 31427 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 11 Sep 1991 23:22:02 PDT Subject: file system not attached during boot When allspice was rebooted the /sprite/src/kernel filesystem was not attached. The "prefix" command is in its bootcmds, yet it didn't do anything. All other file system appear to have attached properly. I issued the prefix command by hand and it attached without any errors. I didn't find a prefix process in the debug state or anything, so I'm not sure what happened to it. John Log-Number: 31429 Date: Thu, 12 Sep 91 17:46:00 PDT From: shirriff (Ken Shirriff) Subject: cc bug Compiling the following program puts cc (actually ccom) into the debugger on a ds3100. (Admittedly, the program is in error, but going into the debugger seems harsh.) main() double f(c,x) double c,x; { return c*x*(1-x); } { double c; double v; for (c=0;c<3;c+=.01) { v = f(f(f(c,.5),.5),.5); if (v>.48 && v<.52) { printf("%f,%f\n", f(f(f(c,.5),.5),.5)); } } } Log-Number: 31431 Date: Thu, 12 Sep 91 23:56:15 PDT From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice crash: LfsWriteBytes Allspice crashed with: SCSI3#0 can't select SCSI3#0 Target 5 LUN 0 LfsError on /pcs Status 0x1, LfsWriteBytes short write The core is in vmcore.short Log-Number: 31432 Date: Fri, 13 Sep 91 08:52:30 PDT From: bmiller (Bob Miller) Subject: allspice crash allspice was down when I came in this morning... Fri Sep 13 02:00:00 Warning: SCSI3#0 cant select SCSI3#0 Target 5 LUN 0 " " " Fatal Error: Lfs Error: on /tmp.old status 0x1, LfsWriteBytes short write Entering debugger with a Interrupt Trap (16) exception at PC 0xf60ca9dc core dump is in vmcore.allspice.crash.9-13 Log-Number: 31433 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 13 Sep 1991 09:07:20 PDT Subject: wrong device resets bus Both of allspice's recent crashes happened when it tried to access a non-existent device on hba 0. I created a tape device for larceny, but screwed up so the server was localhost. When allspice tried to do the dumps it would crash. We should probably get rid of the frequent resets in the scsi driver. John Log-Number: 31437 From: mendel (Mendel Rosenblum) Subject: Re: Allspice reboot Date: Fri, 13 Sep 91 12:28:00 PDT > > Oops, John's message reminded me that I forgot to report the Allspice > crash this morning. Here are the last few lines from the console: > > Entering debugger ... > > I took a core dump and left it in ginger:/home/ginger/cores/vmcore.13Sep91. > > -John- The crash was a Level 15 interrupt caused by a cache writeback error. The bad address was 0xfff141f0 in context 0. This is a mapping area for VDMA used by the scsi HBAs. I have no idea what would cause this error to happen. At the time dump.new was dumping the file "bsdtraces" from /user6. Mendel Log-Number: 31442 Date: Fri, 13 Sep 91 15:40:52 PDT From: ouster (John Ousterhout) Subject: Sendmail problem? There's a sendmail process on tyranny that seems to be looping infinitely; every 5 seconds it prints out a message on the syslog like the following: <18>Sep 13 15:40:16 sendmail[34b10]: NOQUEUE: SYSERR: getrequests: accept: operation not supported on socket Does anyone know what this means or what can be done to unwedge the sendmail process? -John- Log-Number: 31447 Date: Sat, 14 Sep 91 17:23:52 PDT From: shirriff (Ken Shirriff) Subject: gprof broken If I compile a program with -pg on the sun4 to use the profiler, the program dies on execution with a segmentation violation in monstartup. Ken Log-Number: 31448 From: mendel (Mendel Rosenblum) Subject: Re: gprof broken Date: Sat, 14 Sep 91 17:46:43 PDT > > If I compile a program with -pg on the sun4 to use the profiler, the program > dies on execution with a segmentation violation in monstartup. > > Ken The bug is in the Unix compatibility stuff. When linking with the -pg switch a special startup code get executed (entry gstart). This doesn't match the check in proc/sun4c2.md/procMach.c so it thinks it is a Unix binary and sets up the heap segment incorrectly. The bug is in a "hack" in procMach.c Mendel ps. The code in procMach.c is an insult to the definition of a "hack". If someone changes something in the first couple of instruction of the startup code for a user program it will quit working. From sun4c.md/machMach.c /* * The following few lines are total hack. The idea is to look at * the startup code to see if it was a Sprite-compiled file, or * a Unix-compiled file. */ sizeRead = 4*sizeof(int); status = Fs_Read(filePtr, (char *)data, execPtr->entry-PROC_CODE_LOAD_ADDR(*execPtr), &sizeRead); if (status != SUCCESS) { printf("READ failed\n"); return(PROC_BAD_AOUT_FORMAT); } #ifdef sun3 if (data[0]==0x241747ef && data[1]==0x42002 && (data[2]==0x52807204 || data[2]==0x5280223c) && ((data[3]&0xffff0000)==0x4eb90000 || data[3]==4)) { #else /* Normal sun4 startup code */ if ((data[0]==0xac10000e && data[1]==0xac05a060 && data[2]==0xd0058000 && data[3]==0x9205a004) || /* Profiled sun4 startup code */ (data[0]==0xbc100000 && data[1]==0x11000008 && data[2]==0x13000208 && data[3]==0x400038df)) { #endif type = TYPE_SPRITE; } else { type = TYPE_UNIX; #ifdef sun3 /* * Special check for emacs, which has weird startup code. */ if (data[0]==0x4e560000 && data[1]==0x61064e5e && data[2]==0x4e750000) { type = TYPE_SPRITE; } #endif Log-Number: 31456 From: mendel (Mendel Rosenblum) Subject: Re: rdate fails on clean kernel Date: Mon, 16 Sep 91 14:16:08 PDT > Subject: Re: rdate fails on clean kernel > Cc: bugs@sprite.Berkeley.EDU, jhh@sprite.Berkeley.EDU > > I think leaving it broken is a bad idea. Not all programs depend on > the kernel being broke. Some expect the kernel to behave properly. > > dl I agree that leaving it broken is a bad idea but fixing it requires recompiling everything that uses the library routine "connect()". By just changing the kernel and not recompiling everything we break lots of stuff and spend the next year finding programs that fail for unknown reasons and start working again when recompiled. Since we are changing over to implement connect() as a real system call which will require recompiling all the code that uses connect() anyway, I think we should leave it broken for now. If we really want to fix it now we should add a new correctly working system call for select and leaving the old broken one for backward compatility. Mendel Log-Number: 31458 Subject: can't build kgdb.sun4 for sun3 Date: Mon, 16 Sep 91 17:49:50 PDT From: Mike Kupfer <kupfer> If you do a "pmake sun3" in /sprite/src/cmds/kgdb.sun4, you get the following error (at least if you do it on a sun4). --- sun3.md/dep.o --- In file included from dep.c:42: /sprite/src/lib/include/sun4.md/sys/core.h:51: field `c_fparegs' has incomplete type The problem complaint is from the definition of "struct core": struct core { int c_magic; /* Corefile magic number */ int c_len; /* Sizeof (struct core) */ struct regs c_regs; /* General purpose registers */ struct exec c_aouthdr; /* A.out header */ int c_signo; /* Killing signal, if any */ int c_tsize; /* Text size (bytes) */ int c_dsize; /* Data size (bytes) */ int c_ssize; /* Stack size (bytes) */ char c_cmdname[CORE_NAMELEN + 1]; /* Command name */ #ifdef sun3 /* This is from the old core.h (C) 1985, but gdb still wants this stuff, and I don't think any other sprite program uses core.h */ struct fp_status c_fpstatus; /* External FPP state, if any */ struct fpa_regs c_fparegs; /* FPA registers, if any */ int c_pad[CORE_PADLEN]; /* see comment above */ #else #ifdef FPU struct fpu c_fpu; /* external FPU state */ #endif int c_ucode; /* Exception no. from u_code */ #endif }; The definition of "struct fp_status" comes from <sun4.md/reg.h>, but that file doesn't have "struct fpa_regs". mike Log-Number: 31459 Subject: ds3100 kmsg doesn't compile Date: Mon, 16 Sep 91 18:11:15 PDT From: Mike Kupfer <kupfer> --- ds3100.md/kmsg.o --- rm -f ds3100.md/kmsg.o cc -g3 -O -Dds3100 -Dsprite -Uultrix -I. -Ids3100.md -I. -I/sprite/lib/include -I/sprite/lib/include/ds3100.md -c ds3100.md/kmsg.c -o ds3100.md/kmsg.o ccom: Error: ds3100.md/kmsg.c, line 46: syntax error static Dbg_Request *requestPtr = (Dbg_Request *) requestBuffer; --------------------------^ Plus a zillion other errors propagated from this one. mike Log-Number: 31460 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 16 Sep 1991 22:50:29 PDT Subject: exec from pfs fails I'm not sure of the status of doing an exec() from a pfs so perhaps this bug has been fixed already. I'm running 1.099 on a sun4c. Cd into /home/ginger/sprite/users/jhh and do the following: while(1) ./date end The kernel will eventually die with the following: Fatal Error: Fs_RetSegPtr, bad stream type <big number> Here is the stack: #1 0xf602e0d4 in Fs_GetSegPtr (...) (...) #2 0xf60cc2e0 in CleanSegment (...) (...) #3 0xf60cc174 in DeleteSeg (...) (...) #4 0xf60cba14 in Vm_SegmentNew (...) (...) #5 0xf60daae0 in Vm_MmapInt (...) (...) #6 0xf60ce7b0 in Vm_MmapStub (...) (...) #7 0xf601256c in MachUnixSyscallTrap () John Log-Number: 31461 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 16 Sep 1991 23:07:10 PDT Subject: waitpid() crashes kernel Sprite doesn't have waitpid() in its C library (it probably should but that's another story). On a whim I copied an executable off ultrix and sunos that uses waitpid to see what happens. I ran my test on the 1.099 program. Here is the program: #include <stdio.h> #include <sys/wait.h> main(argc, argv) int argc; char **argv; { int pid; int status; int ret; pid = fork(); if (pid == 0) { sleep(1); printf("Child exiting\n"); exit(1); } else { ret = waitpid(pid, &status, 0); printf("ret = 0x%x, status = 0x%x\n", ret, status); } } On a Decstation waitpid() returns -1, indicating that an error occurred. This seems reasonable. On a Sun it's as if the waitpid never happened. The subsequent printf never produces any output, although the child's printf works fine. If you run this program over and over the kernel will eventually die with: Fatal Error: invalid segNum Here's the backtrace: #0 panic (__builtin_va_alist=-166911630) (sysPrintf.c line 220) #1 0xf60d47c0 in VmMach_AllocCheck (...) (...) #2 0xf60caafc in VmPageFlush (...) (...) #3 0xf60d07f0 in Vm_DeleteSharedSegment (...) (...) #4 0xf6092434 in ProcExitProcess (...) (...) #5 0xf6091ce4 in Proc_ExitInt (...) (...) #6 0xf6091bbc in Proc_Exit (...) (...) #7 0xf609f47c in Proc_ExitStub (...) (...) #8 0xf601256c in MachUnixSyscallTrap () #9 0x1e083318 in ?? () I put a core file in /sprite/src/kernel/sprite/core.waitpid.sun4c. John Log-Number: 31462 From: mendel (Mendel Rosenblum) Subject: exec from pfs fails and waitpid() crashes kernel Date: Tue, 17 Sep 91 09:45:12 PDT >I'm not sure of the status of doing an exec() from a pfs so perhaps this >bug has been fixed already. I'm running 1.099 on a sun4c. Cd into >/home/ginger/sprite/users/jhh and do the following: > >while(1) >./date >end > >The kernel will eventually die with the following: >Fatal Error: Fs_RetSegPtr, bad stream type <big number> This is the bug in sticky segments pointing to pdev handles are have been freed. JMS added some code to fix this problem. You should try repeating the test on the "clean" kernel. >On a Decstation waitpid() returns -1, indicating that an error occurred. >This seems reasonable. On a Sun it's as if the waitpid never happened. >The subsequent printf never produces any output, although the child's >printf works fine. If you run this program over and over the kernel >will eventually die with: > >Fatal Error: invalid segNum This is a bug in the shared segment implementation. The basic problem is that VmCore data structure and algorithms assume that a page can be in only one segment that is mapped at the same address is all processes address spaces. Shared segments break this assumption and cause the kernel to panic. SunOS dynamically linked binaries make heavy use of shared segments and trigger this bug rapidly. At the last Sprite meeting Ken said he was still working on a fix for this. Mendel ps. John, just for fun you should try running a dynamically linked binary from a PFS. We can take bets on which bug will crash the kernel first. Log-Number: 31463 From: mendel (Mendel Rosenblum) Subject: nfsmount in the debugger on lust Date: Tue, 17 Sep 91 10:11:16 PDT Lust was hanging rpcs because the nfsmount for /home/ginger/sprite was in the debugger. I debugged it a little bit found that is died at line 358 of nfsName.c. The code looks a little bogus: streamPtr = Pfs_OpenConnection(nfsPtr->pfsToken, fileIDPtr, (16 * 1024) + 128, /* request buffer size */ 0, NULL, /* no read buffer */ FS_READABLE | FS_WRITABLE, &nfsFileService); /* * Enable write-behind. We'd like to let a writer overlap its writes. * The request buffer is large enough for 2 8K block writes. Using * write-behind increases the write bandwidth from 9k/sec to 40k/sec. */ 358> if (Fs_IOControl(streamPtr->streamID, IOC_PDEV_WRITE_BEHIND, sizeof(int), &writeBehind, 0, NULL) != 0) { fprintf(stderr, "IOC_PDEV_WRITE_BEHIND failed\n"); } if (streamPtr != (Pdev_Stream *)NULL) { streamPtr->clientData = (ClientData)fileIDPtr; } else { status = EINVAL; } It looks like Pfs_OpenConnection() returned NULL. I suspect that the IOControl should be after the check for NULL and not before it. Mendel Log-Number: 31464 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 17 Sep 1991 10:34:35 PDT Subject: invalid system calls don't produce error message Sprite doesn't support some of the Ultrix system calls (anything above 171). The old kernels used to print out a warning message when you tried to use an unsupported system call. The new binary compatibility stuff just returns EINVAL. I think we should continue to print out the warning since lots of programs are less than perfect in checking for errors. John Log-Number: 31466 From: mendel (Mendel Rosenblum) Subject: malloc() inside MASTER_LOCK in VmMach_PageValidate Date: Tue, 17 Sep 91 13:19:51 PDT ------- Forwarded Message To: mottsmth cc: jhh Subject: Re: No such luck... In-reply-to: Your message of Tue, 17 Sep 91 12:56:55 -0700. <9109171956.AA88080@sprite.Berkeley.EDU> Date: Tue, 17 Sep 91 13:17:07 PDT >From: mendel > Return-Path: mottsmth > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA88080; Tue, 17 Sep 91 12:56:55 PDT > Date: Tue, 17 Sep 91 12:56:55 PDT > From: mottsmth (Jim Mott-Smith) > Message-Id: <9109171956.AA88080@sprite.Berkeley.EDU> > To: jhh, mendel > Subject: No such luck... > > > while (1) > ./date > end > > dies on the clean kernel also. > > Trying to reproduce it up here gives me a completely > different stack trace from the one jhh reported: > > #1 0xf60ad11c in IdleLoop (...) (...) > #2 0xf60acdb8 in Sched_ContextSwitchInt (...) (...) > #3 0xf60b29f0 in SyncEventWaitInt (...) (...) > #4 0xf60b120c in Sync_SlowLock (...) (...) > #5 0xf60b0eec in Sync_GetLock (...) (...) > #6 0xf60c8bbc in Vm_RawAlloc (...) (...) > #7 0xf607a7bc in MemChunkAlloc (...) (...) > #8 0xf607ab44 in malloc (...) (...) > #9 0xf60cea94 in VmMach_PageValidate (...) (...) > #10 0xf60c14ec in VmPageValidateInt (...) (...) > #11 0xf60c2d2c in FinishPage (...) (...) > #12 0xf60c298c in Vm_PageIn (...) (...) > #13 0xf600e810 in MachPageFault (...) (...) > #14 0xf6012834 in MachHandlePageFault () > > I'll see what I can do... > > -- Jim M-S > The problem is that VmMach_PageValidate grabs a MASTER_LOCK() and then tries to do a malloc(). This is illegal because malloc() can context switch. This is a different bug from the pfs execute stuff. It is only triggered when running something in Unix compatibility. To test the pfs problem I'd suggest copying the Sprite "date" binary. Mendel ------- End of Forwarded Message Log-Number: 31467 Subject: race condition in Proc_Detach? Date: Tue, 17 Sep 91 14:59:15 PDT From: Mike Kupfer <kupfer> Proc_Detach doesn't lock the process before setting its termReason, termStatus, and termCode. (Contrast this with Proc_SuspendProcess, which does lock the process.) Am I missing something, or is there a potential race between Proc_Detach and, say, Proc_ResumeProcess? mike Log-Number: 31468 Subject: cruft in Proc_Lock Date: Tue, 17 Sep 91 16:01:51 PDT From: Mike Kupfer <kupfer> Why is Sync_AddPrior called twice in Proc_Lock()? mike P.S. Many of the CLEAN_LOCK ifndef's in procTable.c are unnecessary. Sync_RecordMiss, Sync_RecordHit, Sync_StoreDbgInfo, and Sync_AddPrior are all defined to be no-op macros if CLEAN_LOCK is defined. Log-Number: 31470 From: mendel (Mendel Rosenblum) Subject: sun4 loader from hell returns Date: Wed, 18 Sep 91 16:57:05 PDT Last night at 18:50 the sun4 loader from hell was installed. Any linking done for the sparc machine type between last night and this afternoon at around 16:50 should be redone. The broken binary in /sprite/cmds.sun4/ld was replaced with the one in /sprite/cmds.sun4.old/ld. Mendel Log-Number: 31472 Date: Thu, 19 Sep 91 15:11:36 PDT From: ouster (John Ousterhout) Subject: Migration and suspension Twice this afternoon I've noticed that pmakes stopped in the middle of a "ranlib" phase with the ranlib in SUSP state. In both cases I was able to ^Z the pmake, then "fg" it successfully. However, I'm wondering if this is happening because of eviction. -John- Log-Number: 31473 From: mendel (Mendel Rosenblum) Subject: VM/X bug - dev ds5000 module problem Date: Fri, 20 Sep 91 18:41:08 PDT John and I found the problem that caused the X server of the ds5000 not to work with the clean kernel. The problem was that devGraphics.c was passing the wrong size (the size of a pointer rather that what it pointed at) to VmMach_UserMap when mapping the event queue into the server's address space. Since it always maps one page the code only broke when the event queue spanned a page boundary. The clean kernel moved things around enough so this happened. Mendel Log-Number: 31475 Subject: memory smash in shell script code in DoExec Date: Sat, 21 Sep 91 00:00:10 PDT From: Mike Kupfer <kupfer> >From inspection, it looks like there is an ugly memory smash in DoExec. Here are a few lines from the local variables: int extraArgs = 0; char *shellArgPtr; char *extraArgsArray[2]; Here is some code that uses them: if (userArgsPtr->argPtrArray == (char **) NIL) { extraArgsArray[0] = fileName; index = 1; } else { index = 0; } for (i = index; extraArgs > 0; i++, extraArgs--) { if (extraArgs == 2) { extraArgsArray[i] = shellArgPtr; } else { extraArgsArray[i] = fileName; } } extraArgsArray[i] = (char *) NIL; extraArgs has a value of 1 or 2, depending on whether shellArgPtr points to something useful (2 if it does). (Side note: you can bet I'm rewriting this for the Sprite server to be more straightforward.) Now it looks like at the very least shellArgPtr is getting overwritten by the NIL assignment at the end. Furthermore, if the given argPtrArray is ever NIL, extraArgs gets clobbered as well. Fortunately, neither of these variables is used again in the function. mike Log-Number: 31476 Subject: allspice crash: read from clean segment Date: Sat, 21 Sep 91 11:13:19 PDT From: Mike Kupfer <kupfer> Allspice died just as I was going home, so I didn't hang around to send mail about it. It died with LfsOkToRead read from clean segment The kernel was 1.099, and the core file is /home/ginger/cores/allspice.readCleanSeg. mike Log-Number: 31477 Date: Sat, 21 Sep 91 15:55:14 PDT From: shirriff (Ken Shirriff) Subject: Allspice crashes Allspice crashed several times this afternoon with: LfsOkToRead read from clean segment. I rebooted and it quickly crashed again; I continued and it quickly crashed again; so I debugged it and determined /scratch1 is the culprit. I unmounted /scratch1 until someone can figure out how to fix it. Ken Log-Number: 31482 Subject: wait3 incompatibility Date: Sun, 22 Sep 91 00:30:54 PDT From: Mike Kupfer <kupfer> If one calls wait3 with the WNOHANG flag set and there aren't any children that have died, wait3 is supposed to return 0. Intead it returns -1 and sets errno to EWOULDBLOCK. mike Log-Number: 31484 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 22 Sep 1991 19:00:25 PDT Subject: clean ds5000 kernel has Fs_Select problems X applications from the sww cannot be executed on a ds5000 running the clean kernel. Try "/usr/sww/X11/bin/xclock". It will just sit there until you type ^C (SIGINT), then the following is printed in the syslog: Wait (socket.c): Fs_Select failed. Wait (socket.c): Fs_Select returned 0 ready 2: connect (getsockopt) It works just fine on the new kernel. John Log-Number: 31489 From: mendel (Mendel Rosenblum) Subject: Re: raid1 hangs on sync Date: Tue, 24 Sep 91 14:56:00 PDT > > When I run "sync" on raid1, it hangs indefinitely. I don't think it's > just writing out all dirty blocks. > > I'd like to reboot raid1. Any objections if I do that now (I'll send out a > broadcast message before I do it)? > > Pete This is a hardware problem involving the disk rvj41. Mendel Log-Number: 31487 From: Fred Douglis <douglis@cs.vu.nl> Subject: rlogin pdev problem Date: Tue, 24 Sep 91 13:54:01 +0200 I logged into arson and after a couple of minutes got the following: ReplyWithData couldn't send pdev reply; status "address given by the user for a system call was bad" Anyone know what it means? Fred Log-Number: 31488 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 24 Sep 1991 12:07:57 PDT Subject: NIL swapFilePtr Loiter crashed running the clean kernel when it tried to do an Fs_PageCopy and the srcStreamPtr was NIL. The swap file for the segment was NIL, and VmCopySwapPage passed it to Fs_PageCopy anyway. Here is the stack. (srcStreamPtr is actually NIL. The debugger is lying about its value). John #0 Fs_PageCopy (srcStreamPtr=(struct Fs_Stream *) 0x80238f44, destStreamPtr=(struct Fs_Stream *) 0xc021bc04, offset=-1071092488, numBytes=-2145155956) (fsPageOps.c line 215) 215 srcHdrPtr = srcStreamPtr->ioHandlePtr; #1 0x800f9bb4 in VmCopySwapPage (srcSegPtr=(struct Vm_Segment *) 0x8023848c, virtPage=65540, destSegPtr=(struct Vm_Segment *) 0xc81cfecc) (vmServer.c line 455) #2 0x800f0e5c in COW (virtAddrPtr=(struct Vm_VirtAddr *) 0xc81cff24, ptePtr=(unsigned int *) 0xc02869cc, isResident=0, deletePage=1) (vmCOW.c line 1015) #3 0x800effa0 in VmCOWDeleteFromSeg (segPtr=(struct Vm_Segment *) 0x8, firstPage=-2147107340, lastPage=-1071455180) (vmCOW.c line 404) #4 0x800f7c60 in DeleteSeg (segPtr=(struct Vm_Segment *) 0x0) (vmSeg.c line 824) #5 0x800f7be8 in Vm_SegmentDelete (segPtr=(struct Vm_Segment *) 0x8023848c, procPtr=(struct Proc_ControlBlock *) 0x0) (vmSeg.c line 792) #6 0x800c1180 in ProcExitProcess (exitProcPtr=(struct Proc_ControlBlock *) 0x80148680, reason=1, status=0, code=0, thisProcess=1) (procExit.c line 605) #7 0x800c0a68 in Proc_ExitInt (reason=1, status=0, code=0) (procExit.c line 270) #8 0x800c0978 in Proc_Exit (status=-1071849236) (procExit.c line 206) #9 0x800336d0 in MachSysCall () (ds5000.md/machAsm.s line 1659) Log-Number: 31490 Subject: missing status messages Date: Tue, 24 Sep 91 22:03:03 PDT From: Mike Kupfer <kupfer> The following Sprite status codes don't have messages for them in lib/c/etc/status.c: RPC_NACK_ERROR RPC_FS_NO_PREFIX I didn't really understand from looking at the sources what these status codes mean, so I didn't put strings in myself. mike P.S. Is there any reason why the tables in status.c have their sizes hardcoded in? Why not use sizeof to figure out how big the various arrays are? Log-Number: 31492 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 25 Sep 1991 10:56:29 PDT Subject: dump doesn't deal with crashes The dump program doesn't deal with crashes very gracefully. The new tape drives do not support update-in-place so our old method of doing things won't work. The new drives have a feature to skip to the end of data. Unfortunately, EOD is a mark written on the tape when you reposition the tape after a write, or write a filemark (I think). If you crash during a dump, EOD is not written, so a subsequent skip-to-EOD will fail when it hits blank tape. In this case dump should back up a file and continue dumping. Unfortunately there are many other reasons why you might not find the EOD mark (like a media error), in which case dump should just bail out. Right now all the dump program will get back from the kernel is DEV_HARD_ERROR if EOD is not found, so dump can't differentiate between the two cases. Currently dump will just bail out. This means that you must put a new tape in the drive if the machine with the tape drive crashes during the dump. John Log-Number: 31499 Date: Thu, 26 Sep 91 00:13:02 PDT From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: LfsSetSegUsage crash Allspice crashed after cleaning /swap1 with LfsSetSegUsage called on a clean segment. We put a core in allspice.lfssetsegusage and continued it. It seems to work, although it printed a warning about numbytes = -4006. Log-Number: 31500 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 26 Sep 1991 00:15:22 PDT Subject: Re: LfsSetSegUsage crash The segment in question was #731. And the warning message was that activeBytes was -4006. John Log-Number: 31505 Date: Thu, 26 Sep 91 18:33:51 PDT From: pmchen (Peter M. Chen) Subject: lpd startup problems When I reboot my machine (mustard, ds5000) with the "new" kernel, lpd dies with a bad TLB fault. This is repeatable. Pete Log-Number: 31507 Date: Thu, 26 Sep 91 19:34:06 PDT From: pmchen (Peter M. Chen) Subject: lpd startup problems It still doesn't work. I tried restarting it using lpc restart pulla (I did this as root) and got <51>Sep 26 19:32:28 lpd[92c3d]: Lock error, pid 0x45118 Bad user TLB fault in process 92c3d: pc=0 addr=0 <51>Sep 26 19:32:28 lpd[62c34]: Lock error, pid 0x45114 Bad user TLB fault in process 62c34: pc=0 addr=0 Also, lpc complained with: /hosts/mustard.Berkeley.EDU/dev/printer: no such file or directory Pete Log-Number: 31508 Subject: Re: lpd startup problems Date: Thu, 26 Sep 91 20:14:16 PDT From: Mike Kupfer <kupfer> Hmm. Well, from looking at the lpd sources, what's happening is that lpd is trying to acquire the lock file for the printer queue. It fails and tries to verify that the daemon holding the lock is still around. Well, the pid that it's reading from the lock file is for a process on hoot (even though this is all happening on mustard), so of course it can't find that process. It complains ("Lock error...") and forces a segmentation violation by trying to jump to location 0. Maybe there's a bug in file locking in the new kernel...? mike Log-Number: 31511 From: mendel (Mendel Rosenblum) Subject: raid module not installed Date: Fri, 27 Sep 91 11:30:02 PDT The raid module was not install during the last kernel install so the newly built sun4 kernel (1.100) uses an old version of it. This means that the raid module being used on raid1 is not "clean". Hopefully this inconsistency will not trash anything. Mendel Log-Number: 31517 Date: Sun, 29 Sep 91 16:05:44 -0700 From: dlong@cse.ucsc.edu Subject: initsprite staying around initsprite seems to stay around if bootcmds exits normally, but if bootcmds exits abnormally, initsprite goes away. This is on a sun4c. dl Log-Number: 31522 Date: Tue, 1 Oct 91 12:40:09 PDT From: margo (Margo Seltzer) Subject: Disk space disappearing /postdev is an old sprite file system. It shows the following disk utilization: Prefix Server KBytes Used Avail % Used /postdev piracy 309808 278827 0 100% However, a du from /postdev shows: piracy.Berkeley.EDU [523]: du 8 ./lost+found 168951 ./margo 168971 . piracy.Berkeley.EDU [523]: and an ls shows: piracy.Berkeley.EDU [526]: ls -sR total 9 8 lost+found/ 1 margo/ lost+found: total 0 margo: total 168950 1 RESULTS 108 teller 6 txnerror.log 159404 account 1 testit* 7732 txnlog 48 branch 1 testit.out 1648 history 1 tp1@ I believe that this occurs when processes with mapped files exit unexpectedly, but I haven't seen a correlation between the amount of space missing and the maximum size of the shared regions. - M Log-Number: 31538 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 3 Oct 1991 12:17:15 PDT Subject: rpcstat -trace broken rpcstat -trace produces gibberish. I remember agreeing to leave the rpc tracing in the kernel. John Log-Number: 31544 From: rab (Robert A. Bruce) Subject: Re: rpcstat -trace broken Date: Thu, 03 Oct 91 18:05:30 PDT > rpcstat -trace produces gibberish. I remember agreeing to leave the > rpc tracing in the kernel. The rpc tracing is still in the kernel, but it is off by default. If you want to use it, you need to set rpc_Tracing to TRUE. -bob Log-Number: 31541 From: mendel (Mendel Rosenblum) Subject: Select problem? Date: Thu, 03 Oct 91 16:34:56 PDT ------- Forwarded Message Return-Path: margo Received: by sprite.Berkeley.EDU (5.59/1.29) id AA539695; Thu, 3 Oct 91 15:57:56 PDT Date: Thu, 3 Oct 91 15:57:56 PDT >From: margo (Margo Seltzer) Message-Id: <9110032257.AA539695@sprite.Berkeley.EDU> To: mendel@sprite.Berkeley.EDU Subject: piracy hanging Mendel, ..... I just tried connecting to piracy with Kdbx and got a series of: Fs_SocketStub: open failure 40001: sendto (ioctl) messages. I'm rebooting at the moment. Any suggestions? - - M PS: It's running the latest kernel. ------- End of Forwarded Message Log-Number: 31547 From: mgbaker (Mary Gray Baker) Subject: Larceny died with sched_Mutex deadlock Date: Fri, 04 Oct 91 11:33:13 PDT Last night larceny died with a deadlock on sched_Mutex. One of the processes involved was a killdebug, and the other was a make process that was trying to go into the debugger, I think. I couldn't look at it for long since people were using the machine, and they were desperately trying to finish a class programming assignment. Stack trace of process 1: Mach_ContextSwitch Sched_LockAndSwitch Mach_UserAction MachReturnFromTrap Stack trace of process 2: Mach_ContextSwitch Sched_ContextSwitch Proc_SuspendProcess Sig_Handle MachUserAction MachReturnFromTrap Stack trace of the kernel: Sync_SlowBroadcast RpcClientDispatch Net_Input NetLERecvProcess NetLEIntr MachHandleInterrupt MachContextSwitch2 Sched_ContextSwitchInt Sched_ContextSwitch Proc_SuspendProcess Sig_Handle Mary Log-Number: 31548 From: rab (Robert A. Bruce) Subject: fatal error in FsCacheFileBlocks Date: Fri, 04 Oct 91 15:34:55 PDT Covet crashed with Fatal error: FsCacheFileBlocks, bad block It is still in the debugger if anyone wants to look at it. -bob Log-Number: 31552 Date: Sat, 5 Oct 91 15:46:32 PDT From: mendel (Mendel Rosenblum) Subject: Adduser from uid database doesn't set password Running the adduser script and specifying the account info be taken from the UCB uid database causes the account to be created with a password entry of "*". This means that account can not be used until someone with root access sets a usable password for the account. If the uid database returns a password of "*" the adduser script should prompt for an initial password. Mendel Log-Number: 31554 From: mendel (Mendel Rosenblum) Subject: Long-running program crashes ds5000 Date: Sun, 06 Oct 91 14:32:27 PDT I started three long-running simulations on 3 decStations 5000 (loiter, hijack, and arson) yesterday. When I came in this morning all three machine were in a frozen state. There are no error messages on the console and the machines are not responding to "kmsg". The simulation does use floating point. The runs were using different alogorithms and random number seeds. The number of simulation steps (disk writes) varied widely from 140 to 614 million before the machine froze. Did we ever fix the problem with long-running floating point simulations like this one and Pete Chen's? Mendel ps. This same program has been running for the last week on sparcStations with zero problems. Log-Number: 31569 From: mendel (Mendel Rosenblum) Subject: DecStation freeze up problem Date: Tue, 08 Oct 91 20:06:34 PDT John Hartman and I discovered the cause of the decStations freezing up while running long-running simulations. The problem stems from the mips assembly language having two add instructions: "add" and "add unsigned". The difference between the instructions is that the "add" instruction generates an overflow trap if a two-complement overflow occurs. In the assembly language written for the decStation port, the code mainly uses the "add" rather than the "add unsigned" instruction. This machine crash was caused by the "add" instruction used to increase the tlb miss counters. When the long running simulation has generated more than 2^31 tlb faults the "add" instruction that increments the counter traps. A trap inside the tlb handler is bad news because even the kernel uses the tlb reload routine. Sprite goes into an infinite loop taking exceptions because the tlb handler quits working. The ds5000 run Sprite is capable of generating about 600,000 tlb misses a second. At this rate the kernel last about an hour before death. My log wrap simulator generates around 200,000 tlb misses a second so it last around 3 hours. This also imples that the simulator is spending 1/3 its time in the tlb reload routine. I haven't fixed anything. We need to go thru the kernel and fix all the "add" instructions that should be "addu". This problem can has occur for addresses because decStation kernel address have the top bit set. Mendel Log-Number: 31555 Date: Sun, 6 Oct 91 16:44:17 PDT From: shirriff (Ken Shirriff) Subject: File system deadlock I looked at a file system deadlock Margo is experiencing using big shared memory files, but I don't know what the solution is: Her program tries to page in a page from a shared file "buf.shared", but there are no free pages. So it has the cacheInfoPtr->lock held for buf.shared, and is waiting for a clean block. Meanwhile, there are 3 processes which are trying to page out dirty pages from buf.shared, but they have to get the cacheInfoPtr lock on buf.shared before they can continue. So the file is locked until there is a free cache block, but no blocks can be freed until the file is unlocked. Any ideas on how to solve this? Ken Log-Number: 31556 From: mendel (Mendel Rosenblum) Subject: rlogind infinite loop Date: Sun, 06 Oct 91 18:12:33 PDT I've noticed rlogind going into an infinite loop on decStations. I debugged one on subversion and found the following problem: The program reads a 4 byte buffer off the net containing the characters 0xff, 0xff, 0xff, 0xff. Since this starts with the magic characters the rlogind assumes that it is a special command and calls tne control() routine. The control() ignores the sequence because it is not long enough to be a command. Because the control() doesn't increase the buffer to note that the the characters are processed, the rlogind goes into an infinite loop calling the control() with the same buffer. There is a command with the code that might explain the problem: /* * Scan over input data looking for control requests * (which are preceded by "magic" characters). Send normal * data to the terminal driver, and control info to a * special procedure for handling. By the way, the code below * is gross, since it won't work if the control information * happens to span a buffer boundary (but if it's good enough * for UNIX, then I suppose it's good enough for Sprite). */ The rlogin is command from annex1.berkeley.edu which is some kind of terminal server. It might be that this sends different control sequences than a normal unix rlogin process. Another possibility is that some changes to be compatibility code caused this problem. I've only seen it on decStations. Mendel Log-Number: 31558 Date: Sun, 6 Oct 91 19:48:39 PDT From: mendel (Mendel Rosenblum) Subject: Re: rlogind infinite loop I've fixed the bug that caused the infinite loop in rlogind and reinstall rlogind on all machine types. The bug was introduced when rlogind was ported to Sprite. It's behavior now on Sprite is the same as on Unix. This still leaves the question of is the rlogin really sending four bytes of 0xff character or is something in sprite, the ipServer, or the compat library doing something wrong. Mendel Log-Number: 31559 Subject: pmake didn't compile Date: Sun, 06 Oct 91 21:23:15 PDT From: Mike Kupfer <kupfer> There was some code in pmake that didn't compile on Sage because on suns it expected "struct direct" to have a member "d_fileno". Neither the pmake dir.c or the system dir.h have changed recently, so I don't know how this used to compile. I changed "ifdef sun" to "if defined(sun) && !defined(sprite)" since (1) Sprite uses different member names and (2) with the Sprite libc the check inside the ifdef isn't needed anyway. I'm not real happy with this solution, so if someone would like to suggest something cleaner, I'm all ears. mike Log-Number: 31561 Date: Mon, 7 Oct 91 11:49:47 PDT From: schauser (Klaus Erik Schauser) Subject: Latex on Cardamom does not work anymore cardamom:/pcs/schauser> latex Called Sprite syscall *** compat: Invalid message # for Gen module: status = 0x16 This is Common TeX, Version 2.9 (no format preloaded) (Fatal format file error; I'm stymied) cardamom:/pcs/schauser> Log-Number: 31563 Date: Mon, 7 Oct 91 13:32:53 PDT From: shirriff (Ken Shirriff) Subject: latex in compatibility mode Jim determined that latex doesn't run from the sww on Sprite because it tries to access /usr/sww/lib/tex/inputs/article.sty. However, inputs is a link to ../../share/lib/TeX/inputs. The problem is that under Unix, the sww main directory is mounted as /usr/sww, while the sww share directory is mounted as /usr/sww/share. That is, one NFS file system is mounted as a subdirectory of another NFS file system. Unfortunately, I don't think we can do this with Sprite's nfsmount, since we use remote links instead of a mount table. Any ideas on what to do about this? How hard would it be to modify nfsmount to permit this? Could it be done some way that's not totally gross? Ken Log-Number: 31568 Date: Tue, 8 Oct 91 17:48:37 PDT From: shirriff (Ken Shirriff) Subject: makedepend too slow (whining) Running "pmake dependall" on the kernel takes literally hours to run. After 2 hours, I'm about 1/3 of the way through. Log-Number: 31570 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 9 Oct 1991 11:09:19 PDT Subject: bug in VmMach_TLBFault The routine VmMach_TLBFault for the ds5000 has a bug in which under some conditions it returns FALSE if the address is invalid. Unfortunately SUCCESS == FALSE and FAILURE == TRUE, so the higher-level code thinks the TLB fault was handled correctly. If the problem happens in a user-level program the process will loop forever. If it happens in the kernel the kernel will loop forever. What is the state of the kernel install? Should this fix wait for the next kernel? John Log-Number: 31572 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 9 Oct 1991 21:31:22 PDT Subject: compat bug with grep? I put /sprite/cmds.compat in my path as suggested. I've found that if I try to redirect the output from /sprite/cmds.compat/grep into a file the file ends up empty. I'm running this on a ds5000, and /sprite/cmds.compat appears in my path before /sprite/cmds. John Log-Number: 31576 Date: Wed, 9 Oct 91 23:44:48 PDT From: shirriff (Ken Shirriff) Subject: Re: compat bug with grep? I had a bug in the new compat library for the ds3100; it wasn't doing a fflush on exit. I don't know what other compat programs are affected; they should probably all be recompiled sometime anyways, since I made a fair number of changes the last couple days. Ken Log-Number: 31577 Date: Thu, 10 Oct 91 08:20:22 PDT From: ouster (John Ousterhout) Subject: Lust crash When I came in this morning Lust was in the debugger. The last error message on the console was about a disk error on SCSI #0, Target 4, info bytes "0x0 0x5 0xe4 0xa". I tried rebooting "mgbaker" but there was no such kernel on Lust's disk so I rebooted with "new". -John- Log-Number: 31579 Date: Fri, 11 Oct 91 08:36:50 PDT From: bmiller (Bob Miller) Subject: Allspice down this morning When I came in this morning, allspice was down... Warning: SCSI Disk SCSI3#0 Target 1 LUN 0 error: media error - info bytes 0x0 0x6 0x7d 0xad Entering debugger with a Interrupt Trap (16) exception at PC 0xf60c0eec core dump is in 'vmcore.allspice.crach.11oct' Log-Number: 31581 Subject: bug w/ Xcfbpmax and pseudo devices? Date: Fri, 11 Oct 91 13:56:08 PDT From: Mike Kupfer <kupfer> I ran into a problem where I'd be sitting in front of piracy, start xmh on sage, and see sage% Warning: translation table syntax error: Unknown event type : B Warning: ... found while parsing '<Btn1Down>,<B' The complaint has to do with processing the resources from my .Xdefaults, not from setting up the default xmh resources. My "reply" and "compose" buttons don't get installed correctly. I've tried a variety of machines, versions of xmh, and kernels. It looks like there's a strange interaction between the X server and pseudo-devices. (Note that arson-arson is okay, arson-piracy is bad normally, but arson-piracy works when DISPLAY is set so that the communication has to go through the IP server.) By the way, I couldn't try an old Xcfbpmax, because the version in /X11/R4/cmds.ds3100.old is only about an hour older than the installed version, and it exhibits the same problem. mike ---------- X server xmh on result -- -- -- sage (DL.245) sage OK piracy (MARGO.9)piracy OK sage piracy OK piracy sage bad piracy sabotage (1.100)bad piracy coons (1.100) bad arson (1.100) sage bad arson piracy bad arson (1.099) sage bad arson (1.099) piracy (1.099) bad (using old xmh, too) arson (1.099) arson (1.099) OK arson piracy OK (with DISPLAY set to arson.BERKELEY.EDU:0) Log-Number: 31588 Subject: hung RPCs to raid1 Date: Sat, 12 Oct 91 22:25:48 PDT From: Mike Kupfer <kupfer> I've been having problems this evening with processes hanging. In Sage's syslog I'd see RpcDoCall: <io control> RPC to raid1 is hung followed after a long pause by <io control> RPC ok I thought that raid1 might be getting hung up cleaning, so I looked in its syslog. I found the following suspicious looking lines (Sage is client 33): ClientCommand, delete msg to client 33 file "llib-lc.ln" <1,149314> failed 40012 Client state killed: 0 refs 0 write 0 exec ClientCommand, delete msg to client 33 file "libc.a.new~" <1,234145> failed 40012 Client state killed: 0 refs 0 write 0 exec ClientCommand, delete msg to client 33 file "lint" <1,78070> failed 40012 Client state killed: 0 refs 0 write 0 exec ClientCommand, delete msg to client 33 file "psh.o" <1,78072> failed 40012 Client state killed: 0 refs 0 write 0 exec ClientCommand, delete msg to client 33 file "psh" <1,78074> failed 40012 Client state killed: 0 refs 0 write 0 exec ClientCommand, delete msg to client 33 file "libc.a.new~" <1,234144> failed 40012 Client state killed: 0 refs 0 write 0 exec These are in fact some of the files I was trying to update. Anyone know what's going on here? (Raid1 is running the 1.100 kernel, and Sage is running DL.245.) mike Log-Number: 31590 Date: Mon, 14 Oct 91 15:50:54 PDT From: root (The Sprite God) Subject: raid1 crash Raid1 crashed with: LfsError: on /r3 status 0x50003 Can't write segment to log. The kgcore image is in /tmp/vmcore.raid1.elm1. I rebooted the machine. ethan Log-Number: 31591 Date: Tue, 15 Oct 91 10:36:19 PDT From: pmchen (Peter M. Chen) Subject: nfsmount of ginger All the nfsmount of the ginger file systems were dead, so I restarted them (as in /hosts/lust/nfs). This was on lust. Pete Log-Number: 31593 Date: Wed, 16 Oct 91 09:59:02 PDT From: pmchen (Peter M. Chen) Subject: sprite crash yesterday Yesterday around 4:30pm, allspice went into the debugger. I think the message was LfsCleanSegment, but I'm not sure. It wouldn't respond to Break-A, or anything else, so we watchdog reset it and rebooted with "new". It fsck'ed, then rebooted automatically. As it came up, the disk "/allspiceA" hung the SCSI bus (disk light remained on) and allspice went into an infinite loop complaining about the scsi bus. I watchdog-reset it again and booted "sprite". Pete Log-Number: 31594 Date: Wed, 16 Oct 91 11:12:46 PDT From: pmchen (Peter M. Chen) Subject: allspice write-back failed: out of disk space RmtFile "/sprite/spool/mail/mgbaker" <10,2382> These messages have been coming up, every 30 seconds for the hour. Dunno if the "send-mail -i sprite-users" process in the DEBUG state on allspice has anything to do with this. Pete ------------------------------------------------------------ mustard% df /sprite/spool/mail/mgbaker Prefix Server KBytes Used Avail % Used / allspice 495968 423364 23007 94% ------------------------------------------------------------ mustard% ls -l /sprite/spool/mail/mgbaker -rw------- 1 mgbaker 89428 Oct 16 10:05 /sprite/spool/mail/mgbaker ------------------------------------------------------------ mustard% stat /sprite/spool/mail/mgbaker --rw------- 1 ID=(1471,155) 89428 bytes /sprite/spool/mail/mgbaker Server Domain File # 14 10 2382 Version 8399 UserType 0x0 Created: Apr 8 11:53:13 1991 Data modified: Oct 16 10:05:30 1991 Descr. modified: Oct 16 11:09:43 1991 Last accessed: Oct 13 13:53:53 1991 ------------------------------------------------------------ allspice% ps -au USER PID %CPU %MEM SIZE RSS STATE TIME PR COMMAND root 30e38 3.2 0.4 640 576 RWAIT 9:23 /sprite/daemons/ipServer root a0e46 1.3 0.1 176 176 READY 0:02 rlogind alc f0e44 0.7 0.2 248 248 RWAIT 0:09 mail pmchen e0e48 0.5 0.2 272 272 WAIT 0:02 -csh pmchen 60e59 0.5 0.1 240 176 RUN 0:00 ps -au root 60e3b 0.0 0.0 88 48 RWAIT 0:10 /sprite/daemons/arpd root 60e39 0.0 0.1 216 120 RWAIT 0:01 /sprite/daemons/lpd root 50e3a 0.0 0.2 320 216 RWAIT 0:10 sendmail -bd -q15m root 30e3c 0.0 0.1 144 136 RWAIT 0:03 /sprite/daemons/inetd ... root 50e27 0.0 0.1 280 168 RWAIT 0:01 /sprite/daemons/migd -D ... root 70e3d 0.0 0.2 240 224 RWAIT 0:07 -csh root 10e40 0.0 0.1 120 112 RWAIT 0:01 /sprite/daemons/tftpd root 30e41 0.0 0.1 176 96 WAIT 0:00 /sprite/cmds.$MACHINE/lo... root b0e42 0.0 0.1 168 168 WAIT 0:00 rlogind root 10e17 0.0 0.1 240 72 WAIT 0:03 csh ... root 80e37 0.0 0.2 224 208 WAIT 0:14 /sprite/daemons/lpd root 50e3f 0.0 0.1 120 104 WAIT 0:02 /sprite/daemons/cron root 70e47 0.0 0.1 168 160 WAIT 0:00 login -h ... root 30e45 0.0 0.1 88 72 RWAIT 0:55 /sprite/daemons/mopd ddgarcia 40e49 0.0 0.1 104 104 RWAIT 0:00 more inbox/21 root 40e4a 0.0 0.1 184 184 RWAIT 0:00 telnetd root 20e4b 0.0 0.1 160 136 RWAIT 0:00 /sprite/daemons/bootp root 10e4c 0.0 0.1 384 152 DEBUG 0:05 send-mail -i sprite-users root 30e4d 0.0 0.0 72 48 RWAIT 0:02 newtee -inputFile ... root 30e4e 0.0 0.0 72 48 RWAIT 0:01 newtee -inputFile ... root 60e4f 0.0 0.2 200 200 RWAIT 0:01 telnetd root d0e50 0.0 0.1 176 168 WAIT 0:00 /sprite/cmds.$MACHINE/lo... ddgarcia a0e53 0.0 0.2 248 248 WAIT 0:02 -csh root 10e0e 0.0 0.0 96 0 WAIT 0:00 cmds/initsprite -b ... root 40e5b 0.0 0.1 168 168 RWAIT 0:01 rlogind root f0e5c 0.0 0.1 168 160 WAIT 0:00 login -h ... alc f0e5d 0.0 0.2 248 248 WAIT 0:01 -csh Log-Number: 31595 Date: Wed, 16 Oct 91 16:18:11 PDT From: pmchen (Peter M. Chen) Subject: Re: allspice write-back failed: out of disk space I rebooted mustard and the problem went away. Pete Log-Number: 31596 Date: Wed, 16 Oct 91 16:22:05 PDT From: pmchen (Peter M. Chen) Subject: Re: allspice write-back failed: out of disk space After I sent out the previous message, the same error message appeared over and over: 10/16/91 16:21:41 allspice (14) RmtFile "/sprite/spool/mail/mgbaker" <10,2382> Write-back failed: out of disk space<40008> It seemed like the mail to bugs triggered this error. Pete Log-Number: 31611 Date: Fri, 18 Oct 91 03:33:13 PDT From: eklee (Edward K. Lee) Subject: Close call with raid1 We almost lost /r3 on raid1 today. The problem started with a consistent read error for a certain file descriptor. (I'm not sure exactly what the problem was; Mike can explain it better.) Mike and I further investigated the problem and discovered that one of the disks on which /r3 is built was generating intermittent hw read errors. We were able to copy the disk by reissuing read requests until they completed successfully (over a hundred read errors occured while copying the disk). Most likely, the data on the disk was OK but the electronics is bad. Please check your files and let me know if any of them are corrupted. As a note, at one point, Mike and I considered restoring the filesystem from tape but the most recent dump was from Saturday. It appears that the problem with dumps completing is still a problem. Ed Log-Number: 31612 Date: Fri, 18 Oct 91 08:46:45 PDT From: ouster (John Ousterhout) Subject: Allspice crash and bogus bootcmds When I came in this morning Allspice was not responding to client requests (e.g. Tyranny was continually broadcasting for /swap1 and receiving no response) although it appeared alive upstairs. There were many messages on the console of the form Fsrmt_RpcRead, no handle, ... FsrmtFileVerify, no handle, ... I rebooted it, but the other machines were still unable to connect to /swap1. Also, the inetd on Allspice died with select errors. Eventually I figured out that /swap1 wasn't mounted on Allspice. I read through /hosts/allspice/bootcmds and discovered that the mount command for /swap1 was commented out! I mounted /swap1 by hand and things appear OK now. However, at some point in there when things were catatonic I also rebooted Lust. Action items: 1. Why is the mount line for /swap1 commented out in /hosts/allspice/bootcmds? Doesn't this need to get fixed ASAP? 2. The 1.101 kernel seems to be having continued problems with select. This also needs to get fixed ASAP or else let's back out to 1.100 again. -John- Log-Number: 31613 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 18 Oct 1991 10:34:17 PDT Subject: Re: Allspice crash and bogus bootcmds /swap1 was unmounted by tve due to problems with it while we were at SOSP. Mendel and I put things back together but we forgot to uncomment the line in /hosts/allspice/bootcmds. It is now fixed. I think the "dalmation" kernel is a real dog and we should back out to 1.100. John Log-Number: 31615 Date: Fri, 18 Oct 91 13:30:52 PDT From: ss@joyride.Berkeley.EDU (Srinivasan Seshan) Subject: nfsmounting sprite drives on sunos I can no longer mount the sprite file systems on my sunOS machine. I get the following error: Oct 17 13:08:31 joyride automount[97]: lust: exports: RPC: Timed out Srini Seshan Log-Number: 31617 From: mgbaker (Mary Gray Baker) Subject: terrorism pmeg free list munged Date: Fri, 18 Oct 91 14:15:04 PDT Terrorism died running the 1.099 kernel in VmCheckListIntegrity while checking a pmegPtr that had a pagecount of 0 and thus was thought to be on the free pmeg list. But it wasn't really on the list and appeared to have been removed already. #0 panic (__builtin_va_alist=-166886998) (sysPrintf.c line 220) #1 0xf60d82b0 in VmCheckListIntegrity (listHdr=(struct List_Links *) 0xf617aff8) (vmSubr.c line 1499) #2 0xf60c7d40 in VmMach_PageValidate (virtAddrPtr=(struct Vm_VirtAddr *) 0xf8101db0, pte=3917487601) (sun4c2.md/vmSun.c line 3466) #3 0xf60d0374 in VmPageValidateInt (virtAddrPtr=(struct Vm_VirtAddr *) 0xf8101db0, ptePtr=(unsigned int *) 0xf64b7550) (vmPage.c line 654) #4 0xf60d1c84 in FinishPage (transVirtAddrPtr=(struct Vm_VirtAddr *) 0xf8101db0, ptePtr=(unsigned int *) 0xf64b7550) (vmPage.c line 1773) #5 0xf60d18e4 in Vm_PageIn (virtAddr=(char *) 0x400f0 <Address 0x400f0 out of bounds>, protFault=0) (vmPage.c line 1601) #6 0xf600e830 in MachPageFault (busErrorReg=32896, addrErrorReg=(char *) 0x400f0 <Address 0x400f0 out of bounds>, trapPsr=285216900, pcValue=(char *) 0x1e184634 "\320%@") (sun4c2.md/machCode.c line 1318) #7 0xf6012394 in MachHandlePageFault () #8 0x1e1844f0 in ?? () #9 0x3800 in ?? () #10 0x2f14 in ?? () #11 0x1e111120 in ?? () #12 0x1e11130c in ?? () #13 0x1e1095cc in ?? () #14 0x1e1095b4 in ?? () #15 0x2c48 in ?? () Mary Log-Number: 31618 From: mgbaker (Mary Gray Baker) Subject: Allspice cleaner ran out of clean segments Date: Fri, 18 Oct 91 17:22:27 PDT Allspice crashed with the error: Lfs ran out of clean segments during cleaner checkpoint. The core file is on ginger in /export1/cores/lfs.noCleanSegments. Mary Log-Number: 31619 Date: Sun, 20 Oct 91 12:38:33 PDT From: shirriff (Ken Shirriff) Subject: Allspice consistency deadlock? >When I came in this morning clients were waiting for Allspice >which wasn't doing anything. Allspice's console was alive >until I did a cd; then it hung and I couldn't do anything. I looked at the core image and it seems like a client consistency deadlock, not a 103 problem. I couldn't figure out who was waiting on whom from the core, so I've rebooted with 103. Ken Log-Number: 31620 Date: Sun, 20 Oct 91 12:41:18 PDT From: shirriff (Ken Shirriff) Subject: migd problem After rebooting allspice, the migd database seemed to be messed up, since 'who' on any machine would result in: MigOpenPdev: Error opening pdev /sprite/admin/migd/pdev (still trying): no such file or directory. I restarted migd on a client and things seemed to straighen themselves out after a few minutes. Ken Log-Number: 31621 From: mendel (Mendel Rosenblum) Subject: Problems with reboot this morning Date: Mon, 21 Oct 91 12:43:27 PDT We had several problems getting Sprite back up this morning once the power had been restored to the machine room. 1) The prefix command for allspice for /sprite/src/kernel fails because /sprite/src is on lust and the lstat() for /sprite/src/kernel fails when lust is down. Lust can't be booted before allspice because it needs "/" from allspice. This problem accounts for the times that allspice rebooted without mounting /sprite/src/kernel. I mounted /sprite/src/kernel by hand after lust rebooted. 2) The remote link for /tmp was nuked sometime during reboot so the prefix command on lust didn't mount /tmp. I recreated the remote link for /tmp and mounted /tmp by hand on lust. 3) Lust would not reboot with the "compat" kernel. It appears that the Sprite Reverse ARP code failed so Lust couldn't figure out its Sprite ID number. It worked ok with the new kernel. I suspect that the Sprite Reverse ARP code doesn't work between machine with the different machine types. I booted the "new" kernel and it worked. 4) Raid1 would not boot with the "compat" kernel. It appears that LFS could not access the raid (/dev/raid3) file systems. The read of the label off /dev/raid3 return zeros for the label. Future reads considered the partition to have size zero. I booted the "new" kernel and it worked. Mendel Log-Number: 31623 Date: Tue, 22 Oct 91 09:00:38 PDT From: mottsmth (Jim Mott-Smith) Subject: Lust died with List_Remove error When I came in this morning Lust was dead with the message: List_Remove: item's pointers are invalid It had apparently gone through recovery with Allspice recently but I don't know if this is relevant. I couldn't get a core dump so I rebooted Lust with 1.099. -- Jim M-S Log-Number: 31624 From: mendel (Mendel Rosenblum) Subject: raid module not rebuilt for kernel install Date: Tue, 22 Oct 91 10:18:21 PDT This is the second kernel install that didn't install the raid module. This there something we can do to promote the raid module to first class status? Mendel ------- Forwarded Message Return-Path: dlong@cats.UCSC.EDU Received: from cats.UCSC.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA593489; Mon, 21 Oct 91 18:49:26 PDT Received: from am.UCSC.EDU by cats.UCSC.EDU with SMTP id AA05834; Mon, 21 Oct 91 18:47:28 -0700 >From: dlong@cats.UCSC.EDU Received: by am.ucsc.edu (5.65/4.7) id AA07435; Mon, 21 Oct 91 18:47:27 -0700 Message-Id: <9110220147.AA07435@am.ucsc.edu> To: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: Re: Problems with reboot this morning In-Reply-To: Your message of Mon, 21 Oct 91 12:43:27 -0700. <9110211943.AA594486@sprite.Berkeley.EDU> Date: Mon, 21 Oct 91 18:47:26 +45722824 > 4) Raid1 would not boot with the "compat" kernel. It appears that LFS > could not access the raid (/dev/raid3) file systems. The read of > the label off /dev/raid3 return zeros for the label. Future reads > considered the partition to have size zero. I booted the "new" > kernel and it worked. > > Mendel This could be because the raid module was not rebuild for the compat kernel. /sprite/src/kernel/sun4.md/raid.o is dated Oct 2. dl ------- End of Forwarded Message Log-Number: 31625 Date: Tue, 22 Oct 91 11:45:25 PDT From: shirriff (Ken Shirriff) Subject: Re: raid module not rebuilt for kernel install >This is the second kernel install that didn't install the raid module. >This there something we can do to promote the raid module to first >class status? Maybe updating the howto file so that following it builds all the modules would be a start. Ken Log-Number: 31626 Subject: allspice crash Date: Tue, 22 Oct 91 12:37:39 PDT From: Mike Kupfer <kupfer> Somebody left an unsigned note on allspice's console, saying that allspice crashed around 0100 today. Whoever it was took a core file (which Ken is looking at) and rebooted allspice with the 1.099 kernel. mike Log-Number: 31627 Date: Tue, 22 Oct 91 12:51:11 PDT From: shirriff (Ken Shirriff) Subject: Re: allspice crash Allspice crashed doing a read because the fileLinks list for a cache block was bad. I couldn't tell if the cache block being processed had a bad link or if the cache block it pointed to had been overwritten. My guess is that something is trashing memory. Log-Number: 31629 From: tve@crackle.Berkeley.EDU (Thorsten von Eicken) Subject: Re: allspice crash Date: Tue, 22 Oct 91 14:14:43 PDT At about the time allspice crashed, I was mounting and unmounting filesystems. At one point I did a "prefix -U" without "prefix -d" beforehand. Dunno if that caused troubles. TvE Log-Number: 31631 Date: Tue, 22 Oct 91 14:41:16 PDT From: ouster (John Ousterhout) Subject: Allspice IP server dead Can someone restart it? Thanks. -John- Log-Number: 31632 Subject: mustard died with "bad stream type" Date: Wed, 23 Oct 91 12:35:23 PDT From: Mike Kupfer <kupfer> Mustard, which is Peter Chen's ds5000, died with Fatal Error: Fs_RetSegPtr [sic], bad stream type 1830844532 It was running the 1.099 kernel. However, when I looked at mustard with kgdb, it reported a perfectly normal stream type of 1. I looked briefly at the assembly code for Fs_GetSegPtr, decided it was hopeless, and rebooted mustard. mike Log-Number: 31634 Date: Wed, 23 Oct 91 13:05:19 PDT From: shirriff (Ken Shirriff) Subject: Block cleaner killed sassafras Sassafras died running the 1.103 kernel due to a bad backend pointer passed to the block cleaner FsrmtCleanBlocks. (It had the address 0x4000c0, which didn't point to valid memory.) Since this came from a Proc_ServerProc, I couldn't tell who gave it this pointer originally. Also, FsrmtCleanBlocks either has a bug or misleading comments: FsrmtCleanBlocks(data, callInfoPtr) ClientData data; /* Background flag. If TRUE it means * we are called from a block cleaner * process. Otherwise we being called * synchrounously during a shutdown */ ... backendPtr = (Fscache_Backend *) data; Note that the ClientData, which is claimed to be a boolean flag, is cast to a structure pointer. Ken Log-Number: 31635 Subject: Re: bug w/ Xcfbpmax and pseudo devices? Date: Wed, 23 Oct 91 17:29:29 PDT From: Mike Kupfer <kupfer> I'm still seeing the sage% Warning: translation table syntax error: Unknown event type : B Warning: ... found while parsing '<Btn1Down>,<B' problems when I run X on arson, rlogin to sage, then start up xmh. Both arson and sage are running 1.103. Going in the reverse direction (run X on sage, rlogin to arson, start up xmh) works fine. My suspicion is that the server is doing a partial read and getting back "B" when it should get back "Btn1Up". I tried rebuilding Xcfbpmax from scratch, but that didn't fix the problem. The current Xsun and Xcfbpmax were both installed at the same time, so I imagine there's some kernel difference that's provoking the problem. mike Log-Number: 31637 Subject: allspice crash: LfsSetSegUsage on clean segment Date: Wed, 23 Oct 91 21:15:51 PDT From: Mike Kupfer <kupfer> Allspice has been getting soundly thrashed, I think by Peter Chen's I/O benchmark program ("adaptWl"), which is running on sedition. Whatever the problem program is, it's causing /swap1 to get cleaned very frequently, which causes the rest of Sprite to get stuck for painfully long periods of time. Anyway, at one point allspice got really stuck. L1-t showed that the timer queue had gotten wedged. I tried L1-a and then continue, but a couple seconds after that allspice went into the debugger with LfsSetSegUsage called on a clean segment (740). Jim claimed that this was a known problem, so I booted the compat kernel (1.103) without taking a core file. mike Log-Number: 31638 Date: Thu, 24 Oct 91 08:31:53 PDT From: ouster (John Ousterhout) Subject: Allspice reboot Allspice was not responding to clients when I came in this morning. Nothing unusual appeared on its console except for a few messages about "spurious interrupts". Since there was nothing obviously wrong, I didn't take a core dump, but I rebooted and it cleared up all the clients. -John- Log-Number: 31639 Date: Thu, 24 Oct 91 09:52:20 PDT From: ouster (John Ousterhout) Subject: New crypt broken A new "crypt" was installed in /sprite/cmds.sun4 by dlong on October 7, but it appears to be broken (at least for me: I get messages like "crypt: cannot generate key" when I attempt to use it). I've overwritten it with the copy saved in /sprite/cmds.sun4.old. -John- Log-Number: 31642 From: dlong@cats.UCSC.EDU Date: Thu, 24 Oct 91 11:25:40 -0700 Subject: Re: New crypt broken The old crypt gave errors similar to "unaligned heap, recompile", so I did. I'm not sure why the recompiled version was broken. dl Log-Number: 31641 From: mendel (Mendel Rosenblum) Subject: sedition ran amuck Date: Thu, 24 Oct 91 10:14:24 PDT Pete Chen's program on sedition adaptWl started paging heavily and wouldn't repond to any signals including SIGKILL. I kmsg -d sedition and couldn't attach the debugger to it. The debugger error message was "Timing out and resending to host sedition." Mendel Log-Number: 31645 Subject: /sprite/cmds.compat/at broken Date: Sat, 26 Oct 91 11:44:59 PDT From: Mike Kupfer <kupfer> Running the 1.105 kernel... sage% which at /sprite/cmds.compat/at sage% date Sat Oct 26 11:41:31 PDT 1991 sage% at 1145 /usr/spool/at/91.298.1145.09: no such file or directory Note that the spool directory does exist, and there don't appear to be any permissions problems. sage% ls -ldg /usr/spool/at drwxr-xr-x 3 root wheel 512 Oct 25 11:13 /usr/spool/at/ sage% ls -l /sprite/cmds.compat/at -rwsrwxr-x 1 root 65536 Oct 7 03:17 /sprite/cmds.compat/at* mike Log-Number: 31646 Date: Sat, 26 Oct 91 12:20:36 PDT From: mottsmth (Jim Mott-Smith) Subject: 'at' doesn't work r.e.: Mike's discovery that 'at' fails with /usr/spool/at/91.298.1145.09: no such file or directory It seems that /sprite/cmds.compat/at fails on 1.103 as well. Also, /sprite/cmds/at works fine on both 1.103 and 1.105. Therefore, I hereby exonerate 1.105 of all slanderous accusations. :-) -- Jim M-S Log-Number: 31647 Date: Sun, 27 Oct 91 17:08:41 PST From: shirriff (Ken Shirriff) Subject: ds3100 wedgeup running 1.105 I tried to run my name cache simulator on my ds3100 running the 105 kernel. After a minute the machine wedged up and didn't respond to anything, including L1-A, so I couldn't debug. The problem doesn't seem to be repeatable. (I guess this bug report will go straight to the ignore pile.) Ken Log-Number: 31648 Date: Sun, 27 Oct 91 17:30:37 PST From: shirriff (Ken Shirriff) Subject: ds3100 wedgeup The previously mentioned ds3100 wedgeup is repeatable (much to my dismay), and also occurs with the 1.098 kernel. Ken Log-Number: 31649 Date: Mon, 28 Oct 91 11:10:04 PST From: shirriff (Ken Shirriff) Subject: Kernel deadlock in profile code This time my simulations deadlocked on the ds5000. I was running a profiled simulator. The problem apparently is that the process received a profiling timer interrupt during a fork while it was in Prof_Disable and had the profilLock monitor lock held. The timer interrupt called the profiler and it deadlocked since the lock was already held. Then the other processes (sendmail, cron, csh, etc.) stacked up waiting for the lock. Ken Log-Number: 31651 From: mendel (Mendel Rosenblum) Subject: Re: Allspice refusing FTP Date: Mon, 28 Oct 91 13:13:40 PST > Subject: Allspice refusing FTP > > Allspice is refusing attempts to FTP to it. Is this an indication > that the ipServer needs to be restarted, or is it a bug in the > new kernel? > -John- The problem is that the inetd doesn't like an error returned to an accept() system call and it shutdowns the ftp service. The error appears in allspice's syslog: <28>Oct 24 13:14:35 inetd[30e4d]: ftp/tcp accept: invalid argument My guess at the problem is that accept() stub requires two communications with the ipServer. The first is an ioctl() that returns a handle for a new connection. The stub then creates a new connection with the ipServer and returns the handle to the ipServer with a ioctl(). I suspect that the client creating the connection is going away between the two ioctl()'s causing the second one to return GEN_INVALID_ARG. Accept in Unix doesn't normally return such an error so Unix programs such as inetd close the socket if it occurs. A better thing to do would be to have the accept() return SUCCESS and let the future access to the socket fail. In the mean time we could also patch inetd to do the right thing. This was already done for the accept in sendmail. Mendel Log-Number: 31656 From: mgbaker (Mary Gray Baker) Subject: Tyranny hung up migration Date: Tue, 29 Oct 91 14:36:30 PST I put tyranny in the debugger because it hung up migration. It hasn't been rebooted since the troubles with /swap1 and so forth this morning so that's probably the problem. Mary Log-Number: 31657 From: rab (Robert A. Bruce) Subject: bmiller's note Date: Tue, 29 Oct 91 14:41:04 PST Bob Miller left the following note on allspice's console: ---------------------------------------------------------- When I came in: allspice console scrolling "can't fetch handle for file 66371 for cleaning." reset + rebooted "new", which proceeded until: "rsd14a: "/" rsd00a: 304 files, 7927 blocks in use, 39937 blocksfree, 256 fragments /swap1: cleaning started - deficient 226 segs Fatal error: LfsError on /swap1 status 0x1 Bad descriptor magic number Entering debugger with a interrupt type (16) exception at pc 0xf60cd754 Took dump -> on vmcore.magicnum.oct29 reset + rebooted again on "new", but got same error message (dump not taken) Tried rebooting "sprite", but got same error ---------------------------------------------------------- -bob Log-Number: 31658 Date: Tue, 29 Oct 91 15:27:35 -0800 From: soumen@cory.berkeley.edu (CHAKRABARTI SOUMEN) Subject: password my password is 11 chars long and after typing only 5 and accidentally pressing <CR> i was duly logged in. started sending mail and in the middle of it was suddenly logged out. could you look into the matter please? Log-Number: 31662 Subject: ds3100 cc bug? "operands of : have incompatible types" Date: Tue, 29 Oct 91 22:57:33 PST From: Mike Kupfer <kupfer> I tried building migcom (the guts of the Mach MIG compiler) on a DECstation and got ccom: Error: server.c, line 1368: operands of : have incompatible types (IsKernelServer ? WriteTypeDeclOut : WriteTypeDeclIn), ------------------------------------------------------------^ *** Error code 1 WriteTypeDeclOut and WriteTypeDeclIn are both functions, declared as extern void WriteTypeDeclIn(/* FILE *file, argument_t *arg */); extern void WriteTypeDeclOut(/* FILE *file, argument_t *arg */); If I cast the two functions to (int *), as in (IsKernelServer ? (int *)WriteTypeDeclOut : (int *)WriteTypeDeclIn), the compiler takes the statement without complaint. Have I actually found a bug in the MIPS C compiler? mike Log-Number: 31663 Date: Wed, 30 Oct 91 08:23:45 PST From: ouster (John Ousterhout) Subject: restartIPServer vs. nfsmount Do the NFS daemons all get restarted correctly when "restartIPServer" is invoked on lust? I needed to reboot tyranny when I came in this morning, but it wouldn't reboot (hung silently). I figured the boot daemons must be wedged. I couldn't remember which daemons were on Allspice and which were on Lust, so I typed "restartIPServer" on both machines. Allspice restarted fine, but a bunch of error messages appeared on Lust's console and it appeared that the nfsmount's hadn't been able to restart (e.g. "df" hung). I was in a hurry and didn't have time to poke around, so I just rebooted Lust. -John- Log-Number: 31664 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 30 Oct 1991 11:09:11 PST Subject: migration broken Migration is having problems on the decstations. I've been running mkmf on the decstations and sometimes the migrated makedepends fail with JobFlagForMigration: warning: eviction of process b5323 apparently did not complete. This happens when the makedepend is evicted. I've also encountered problems with "Error 1" and "Error 16", although I haven't been able to figure out what causes them. Right now I'm running the 1.105 kernel, although the problems predate the kernel. John Log-Number: 31665 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 30 Oct 1991 11:10:58 PST Subject: pmake broken Pmake's children are often left in the SUSP state indefinitely. I think pmake is suspending any children that cannot be remigrated, but then it never continues them. John Log-Number: 31666 From: mendel (Mendel Rosenblum) Subject: Last writer still not reset Date: Wed, 30 Oct 91 11:56:56 PST We are still getting messages of the form: ClientCommand, write-back msg to client 83 file ",RCSnew004360" <2,149278> failed 40012 in allspice's syslog. This happens when allspice does a callback to flush dirty blocks when the client has already recycled the handle for the file. The bug is the lastWriter field of the file handle on allspice is only update when the last block of the file written back because of the delayed writeback. The CLOSE RPC currently doesn't inform the server when the clients have no dirty blocks for the file left. This means that any file that does not have delayed writeback of blocks will get an unecessary callback to flush blocks. Any zero length file, file open for writing and not written, or fsync'ed file will have the problem. The fix is to modify the routine FsrmtFileClose() in fsrmtFile.c to send the FS_LAST_DIRTY_BLOCK flag on the close if there are no dirty blocks in cache. The number of dirty block is return by the call to Fscache_PreventWriteBacks() and this info should be used to get the flag to Fsrmt_Close() correctly. The code on the server side of things appears to be correct. I will make this fix when the CVS conversion in done. Mendel Log-Number: 31669 Date: Wed, 30 Oct 91 14:06:34 PST From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: allspice crash: unaligned addr. Allspice crashed with: Unaligned address trap in kernel: procPTr = f6684128, pc = f6053344 Log-Number: 31672 From: mgbaker (Mary Gray Baker) Subject: Cross compilation error messages not clear Date: Thu, 31 Oct 91 14:28:56 PST Soumen was logged onto a sun3 and tried to run the sun4 linker since he had only run mkmf for a sun4. I have explained to him the problem, so I think no further immediate action is necessary, but maybe a clearer error message could be generated for cross-compilation problems? ------- Forwarded Message Return-Path: soumen Received: by sprite.Berkeley.EDU (5.59/1.29) id AA407065; Thu, 31 Oct 91 14:08:35 PST Date: Thu, 31 Oct 91 14:08:35 PST >From: soumen (SOUMEN CHAKRABARTI) Message-Id: <9110312208.AA407065@sprite.Berkeley.EDU> To: mgbaker@sprite.Berkeley.EDU Subject: install hi mary, last time i manaed to grt the skeleton running in my dir, today without any change pmake is failing with cc; installation problem , cannot exec ld.sun4 : .... this is very exasperating to get held up by such snags. could you please look into the matter? soumen. ------- End of Forwarded Message Log-Number: 31673 Subject: Re: install Date: Thu, 31 Oct 91 17:43:35 EST From: Fred Douglis <douglis@MITL.COM> This follows on to Mary's comment about the cross-compilation environment not generating clear messages. A message like "wrong type of executable" would be so much better than "permission denied".... Fred Log-Number: 31680 Subject: a.out's for other machine types Date: Thu, 31 Oct 91 17:46:42 PST From: Mike Kupfer <kupfer> Actually, Bob was in some sense right when he said > there is no errno the kernel can pass that means > (wrong type of executable). csh apparently interprets ENOEXEC as meaning "not an executable file", so it tries to treat the file as a shell script. This can lead to unedifying error messages like foo: 1: Syntax error: "(" unexpected if the file is a binary of some sort. According to the RCS log for procExec.c, Ken put in code to check for this condition, so that the shell wouldn't try to run the binary as a script. The relevant kernel code (in both DoExec and SetupInterpret) is if (ProcGetObjInfo(filePtr, (ProcExecHeader *)buffer, &objInfo) != SUCCESS) { if (ProcIsObj(filePtr,1) == SUCCESS) { status = FS_NO_ACCESS; } else { status = PROC_BAD_AOUT_FORMAT; } } The ProcIsObj call returns SUCCESS if the file is any sort of object file, even one for a different machine type. Maybe GEN_EINVAL would be better than FS_NO_ACCESS? mike Log-Number: 31674 Subject: gdb on ds5000 can't print past 15 Date: Thu, 31 Oct 91 16:12:34 PST From: Mike Kupfer <kupfer> This is on piracy, running gdb 3.5. (gdb) print /x 4097 (gdb) print /x 16 (gdb) print 16 (gdb) print 2+2 $4 = 4 (gdb) print /x 4097+0 (gdb) print 2 $6 = 2 (gdb) print 16 (gdb) print 15 $8 = 15 (gdb) print 15+1 (gdb) (gdb) print 16-1 $11 = 15 (gdb) print 17-1 (gdb) Log-Number: 31681 Date: Fri, 1 Nov 91 08:21:02 PST From: pmchen (Peter M. Chen) Subject: pmake hangs When I do a pmake, it hangs indefinitely (as with my bug report a couple days ago). This time it had migrated to coons. Is this a known bug? Until it's fixed, we could turn off migration on those machines that hang pmake. Pete ps. This was on mustard, a ds5000 running 1.105. I ran pmake in ~pmchen/adaptWl Log-Number: 31682 Subject: sun3 CFLAGS Date: Fri, 01 Nov 91 15:28:27 PST From: Mike Kupfer <kupfer> The default CFLAGS when the target is a sun3 is currently just "-msun3". Does anyone know why it isn't -msun3 -Dsun3 -Dsprite This would be consistent with all the other machine types. mike Log-Number: 31683 Subject: CFLAGS passed to lint Date: Fri, 01 Nov 91 15:34:07 PST From: Mike Kupfer <kupfer> -D and -I flags are passed to lint, but -U flags (e.g., -Uultrix) get filtered out. Is this deliberate? mike Log-Number: 31685 From: mendel (Mendel Rosenblum) Subject: Re: raid1 crash: jaguar command queue full Date: Sat, 02 Nov 91 16:48:55 PST > Subject: raid1 crash: jaguar command queue full > Date: Fri, 01 Nov 91 23:17:26 PST > From: Mike Kupfer <kupfer> > > Raid1 crashed this evening. There were a bunch of messages on the > console about failed write-backs (mostly .pcf files to client 73), then > > /r3: Cleaning started - deficit 229 segs > Fatal Error: Jaguar3: Command Queue full > > Raid1 was running the 1.099 kernel. I put a core file in > /home/ginger/cores/raid1.jaguarfull and rebooted with the 1.105 kernel. > > mike The problem is that raid1 overran a jaguar board and panic'ed. Some event such as a cleaning starting or someone typing sync on the console caused 16 SCSI commands to be stuffed into the board in rapid succession. The code will only send two commands per device and there are 8 devices attached to the board. The host puts the command in a circular command buffer and the board takes the command from the buffer and queues it internally for the device. There are 8 command buffers so raid1 managed to insert 8 commands before the board removed one. This is the first time this has happen. The easiest fix would be to increase the number of commands buffers to the maximum number of commands that will be queued on the board. This will stop the buffer from getting overrun. Mendel Log-Number: 31688 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 2 Nov 1991 18:35:41 PST Subject: crash on COW The dump program tends to kill hijack if allspice reboots in the middle of it. This is because we don't recover streams correctly, so the streams associated with the swap files for dump are lost. Later when dump tries to exit Fs_PageCopy is called to copy any COW pages. Fs_PageCopy would then die because one of the stream pointers was NIL, because of the failed recovery. I fixed Fs_PageCopy to make sure the streams aren't NIL and return FAILURE if they are. The return status only propagates up as far as the COW routine, so it's possible the machine will die a little later but at least this way we'll get a more meaningful error message on the console. John Log-Number: 31690 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 3 Nov 1991 22:16:35 PST Subject: allspice's timer stopped again Allspice stopped getting timer interrupts again. L1-a followed by 'c' fixed the problem. Would it be possible to add a test to the network interrupt handler to check on the health of the timer and restart it if it has stopped? The test could be done at reasonable intervals to avoid slowing down net interrupt handling. John Log-Number: 31691 Date: Mon, 4 Nov 91 08:16:37 PST From: bmiller (Bob Miller) Subject: Lust rebooted I can in this morning and found LUST hung...consold messages were: Fatal Error: LE ethernet: can not output first packet on restart Entering debugger with a breakpoint trap exception at PC0x800e7aac Bob Log-Number: 31696 From: mendel (Mendel Rosenblum) Subject: Re: gdb sun4 bug? Date: Mon, 04 Nov 91 18:16:23 PST > > I've been groveling over the gdb sources, to get gdb working on > anarchy, and grep pointed out the following code in > /sprite/src/cmds/gdb/sun4.md/m-sparc.h: > \ > if (TYPE_CODE (TYPE) = TYPE_CODE_FLT) \ > > Shouldn't that be "== TYPE_CODE_FLT", not "= TYPE_CODE_FLT"? > > mike Yes, you are right. These has been fixed in gdb 4.0 which is installed on Sprite under the name gdb.new Mendel Log-Number: 31697 Date: Tue, 5 Nov 91 16:00:56 PST From: pmchen (Peter M. Chen) Subject: fscmd -M Works fine on non-fileservers. But, when I use it on mustard (ds5000 using 1.105), which has a scratch disk, it doesn't seem to work. E.g. mustard% fscmd -M 1000 mustard% fsstat Block Cache, 9.04 Mbytes BLOCKS 2315 min 32 max 1000(6417) free 2313 note how there are 2315 blocks in the block cache, even though the max is 1000. then I unmounted the scratch disk (/user4/pmchen/t1), and tried again. mustard% fscmd -M 1000 mustard% fsstat Block Cache, 3.91 Mbytes BLOCKS 1000 min 32 max 1000(6417) free 991 Pete Log-Number: 31701 From: mendel (Mendel Rosenblum) Subject: Re: followup message on fscmd -M for fileservers Date: Tue, 05 Nov 91 16:22:42 PST > To: bugs > Subject: followup message on fscmd -M for fileservers > > When I remounted the scratch disk (prefix -M /dev/rsd01c -l /user4/pmchen/t1), > the block cache magically got bigger, back to > > mustard-4# fsstat > Block Cache, 9.92 Mbytes > BLOCKS 2539 min 32 max 1000(6417) free 2505 > > Is there any way to limit the file cache size for file servers? > > Pete Looks like the fscmd -M doesn't work. The problem here is that the LFS storage managers reserve space in the file cache for cleaning and write buffering. It disregards any "-M" option setting. Mendel Log-Number: 31702 Date: Tue, 5 Nov 91 16:26:14 PST From: pmchen (Peter M. Chen) Subject: Re: followup message on fscmd -M for fileservers >The problem here is that the LFS storage managers reserve space in >the file cache for cleaning and write buffering. It disregards any >"-M" option setting. >> mustard-4# fsstat >> Block Cache, 9.92 Mbytes >> BLOCKS 2539 min 32 max 1000(6417) free 2505 That shouldn't affect the amount of file data that can be cached, though, right? So the maximum amount of file data (apart from the LFS cleaning and write buffering) should be 1000 blocks here, right? Pete Log-Number: 31699 From: mgbaker (Mary Gray Baker) Subject: Non-deterministic compile errors Date: Tue, 05 Nov 91 16:14:30 PST Several times in the last few days a module has failed to compile due to a spurious syntax error. By spurious I mean the syntax error doesn't really exist, and recompiling the module right away without modification succeeds. An example just now was my vm module. The error was --- sun4c.md/vmBoot.o --- In file included from /sprite/src/kernel/Include/rpc.h:23, from /sprite/src/kernel/Include/sig.h:19, from /sprite/src/kernel/Include/proc.h:29, from /sprite/src/kernel/Include/sync.h:57, from ./vm.h:27, from vmBoot.c:16: /sprite/src/kernel/Include/net.h:96: parse error before `Afdress' I looked at net.h and there was no such problem and the file was last modified several days ago. I did a !pmake and the compile succeeded. Mary Log-Number: 31700 Subject: Re: Non-deterministic compile errors Date: Tue, 05 Nov 91 16:18:26 PST From: Mike Kupfer <kupfer> Yes, there's something definitely peculiar going on. I ran a bunch of stuff through diff a day or two ago and noticed that random characters in the diff output were replaced by control characters. mike Log-Number: 31703 Date: Tue, 5 Nov 91 16:33:42 PST From: pmchen (Peter M. Chen) Subject: lfs disk on mustard After I rebooted mustard and started using the scratch disk, I got the following infinite loop of error messages: LfsSetSegUsage: Warning active bytes for segment 403 is -4096 The warnings started for segment 398 or so. Pete Log-Number: 31705 Subject: MIPS optimizer bug w/ function pointers Date: Tue, 05 Nov 91 17:54:19 PST From: Mike Kupfer <kupfer> If you compile the appended file with "-g3 -O -c", the routine "foo" gets optimized away into foo: [bar.c: 11] 0x24: 03e00008 jr ra [bar.c: 11] 0x28: 00000000 nop (i.e., return immediately). If you take away the (char *) cast, the code is not optimized away. If you're wondering why this isn't a problem with the kernel, most of the routines passed to Proc_NewProc are external, not static, and the optimizer doesn't break external functions that way. Also, (most of) the calls in the DECstation mainInit are protected by a truly disgusting cast: (void) Proc_NewProc((Address)(unsigned)(int (*)())Init, PROC_KERNEL, FALSE, &pid, "Init"); Anyway, this problem seems to have been fixed in the 2.0 compiler, so we might want to think about installing the new compiler some time. mike -- int someInt; static void foo(); void random() { NewProc((char *)foo); } static void foo() { printf("gurgle\n"); } Log-Number: 31710 From: mendel (Mendel Rosenblum) Subject: Hijack crash with 1.105 kernel Date: Thu, 07 Nov 91 09:47:37 PST Hijack died around 15:57 yesterday. Messages on the console said something like: badVaddr 0xc8403ae4 TLB Fault at PC 0x800a2bd8 The machine wouldn't respond to ping, kmsg, or kgdb. I reset it and it was in the dbg module in ReadRequest() waiting for a debugger packet. My guess is that debugger didn't work because the net module quit working. The initial fault occurred in bcopy(). The fault address is somewhere in the virtual addresses that would have been allocated to the file cache if the machine had more memory. Mendel Log-Number: 31711 From: rab (Robert A. Bruce) Subject: weird pmake heisenbug Date: Thu, 07 Nov 91 11:53:51 PST Every now and then pmake executes the wrong command. Instead of the command it is supposed to execute, it invokes the previous command, but with the new arguments. This is not repeatable but has occured to me several times in the last few days. Here is an example: ------------------------------------------------------------------ making all in ./demos/xgas... rm -f xgas gcc -o xgas main.o dynamics.o chamber.o timestep.o molecule.o util.o help.o XGas.o doc.o quick.o man.o -O ../.././lib/Xaw/libXaw.a ../.././lib/Xmu/libXmu.a ../.././lib/Xt/libXt.a ../.././extensions/lib/libXext.a ../.././lib/X/libX11.a -L/usr/X11R5/lib -lm -B/usr/bin/ rm: illegal option -- o usage: rm [-rif] file ... *** Error code 1 (continuing) ------------------------------------------------------------------ The problem is that although pmake claims to be running gcc, it is actually running rm twice, but the second time it is invoked with the arguments that are supposed to be passed to rm. When I ran pmake again, it worked fine. -bob Log-Number: 31712 Subject: Re: weird pmake heisenbug Date: Thu, 07 Nov 91 14:55:10 EST From: Fred Douglis <douglis@MITL.COM> There have been a few messages about problems along these lines. Rings a bell -- like we used to have problems just like this a long time ago, when the file offset wasn't being managed properly. Has anyone fiddled with the migration and/or file system code recently? Anyway, you might want to check the Sprite log (in fact, the older logs, which may be separate from wherever messages go right now) and see if a solution to this problem was presented 2-3 years ago... Fred Log-Number: 31713 Date: Thu, 7 Nov 91 16:43:28 PST From: mani (Mani Varadarajan) Subject: bug in ftp? whenever i try to ftp to cory, it refuses my login. this happens only from sprite machines: (arson) mani % ftp cory Connected to cory.Berkeley.EDU. 220 cory.Berkeley.EDU FTP server (Ultrix Version 4.1 Mon Aug 27 19:11:56 EDT 199 0) ready. Name (cory:mani): 331 Password required for mani. Password: 530 Login incorrect. Login failed. ftp> user (username) mani 331 Password required for mani. Password: 530 Login incorrect. Login failed. ftp> from a non-sprite machine: (villandry) mani % ftp cory Connected to cory.berkeley.edu. 220 cory.Berkeley.EDU FTP server (Ultrix Version 4.1 Mon Aug 27 19:11:56 EDT 199 0) ready. Name (cory:mani): 331 Password required for mani. Password: 230 User mani logged in. ftp> this problem repeats. i can rlogin successfully to cory, however. mani Log-Number: 31721 Date: Sat, 9 Nov 91 10:57:21 PST From: kupfer (Mike Kupfer) Subject: compat mail is broken If I type "/sprite/cmds.compat/mail bugs", I get no response. That is, I don't get prompted for a subject (despite the setting of my .mailrc), ^C kills "mail" the first time (normally the first time just gives a warning), and the ~ escapes don't seem to work. mike Log-Number: 31726 Subject: bogus use(s) of Vm_MakeAccessible Date: Mon, 11 Nov 91 11:48:12 PST From: Mike Kupfer <kupfer> Here's an interesting bug in Test_RpcStub: Rpc_EchoArgs *echoArgsPtr = (Rpc_EchoArgs *)argPtr; Time deltaTime; Vm_MakeAccessible(VM_READONLY_ACCESS, sizeof(Rpc_EchoArgs), (Address) echoArgsPtr, &argSize, (Address *) (&echoArgsPtr)); if (argSize != sizeof(Rpc_EchoArgs)) { return(RPC_INVALID_ARG); } Vm_MakeAccessible(VM_READONLY_ACCESS, echoArgsPtr->size, echoArgsPtr->inDataPtr, &inSize, &echoArgsPtr->inDataPtr); The user program passes in a struct that has information about the RPC test to be done. Test_RpcStub maps the struct into the kernel's address space so that it can safely access the fields of the struct. The next thing it tries to do is map the input buffer into the kernel's address space. Unfortunately, it wants to put the kernel address of the buffer in the struct, overwriting the user address. This works in native Sprite on our usual hardware because (1) the user address is already available in the kernel address space (2) the access type argument that is passed to Vm_MakeAccessible is unused (3) on most hardware (i.e., everything except the Sequent), the user address is the same as the kernel address. (Also, callers of Test_RpcStub might not assume that the struct's contents are preserved across the call.) mike Log-Number: 31738 Date: Fri, 15 Nov 91 08:52:05 PST From: pmchen (Peter M. Chen) Subject: lfs, syslog messages I've been pounding on my local lfs disk, and got the following message in my syslog (this was on mustard, ds5000, running 1.105): /user4/pmchen/t1: Cleaning started - deficit 7 segs Can't fetch cache block <14018,2443> for cleaning. /user4/pmchen/t1: Cleaned 186 segments in 19 segments /user4/pmchen/t1: Cleaning started - deficit 32 segs /user4/pmchen/t1: Cleaned 35 segments in 3 segments I've seen the "cleaning started...cleaned x segments" before, but never the "Can't fetch cache block" message. Also, it would be nice to have all the messages which go to the syslog timestamped somehow (now, some are and most aren't). That way I can link some anomalies of performance to allspice crashes, lfs cleaning, etc. Can this be done for all syslog messages? Maybe at least for the common ones, such as the cleaning messages and the RpcDoCall: hung...ok. Pete Log-Number: 31745 Date: Fri, 15 Nov 91 20:45:44 -0800 From: clarsen@postgres.Berkeley.EDU (Case Larsen) Subject: ether addr. The ethernet address for babylon changed. Allspice doesn't seem to think so. I'm told it's a bug. Thanks -- Case [21-Nov-91: theoretically this bug has been fixed. Of course, you have to rerun netroute first... -mdk] Log-Number: 31748 Date: Sun, 17 Nov 91 12:25:48 PST From: mottsmth (Jim Mott-Smith) Subject: Timer queue messed up again... I just 'L1-a continued' it. -- Jim M-S Log-Number: 31749 Date: Sun, 17 Nov 91 15:12:25 PST From: ouster (John Ousterhout) Subject: Allspice load Allspice's load factor has been going through the roof this afternoon, bouncing from 5 to 1 to 7 again. Does anybody have any idea what could be causing this? -John- Log-Number: 31750 Date: Sun, 17 Nov 91 22:05:16 PST From: mottsmth (Jim Mott-Smith) Subject: Award winning telnetd code... Here's a crufty piece from telnetd.c: Do you trust the 'if' statement? dontoption(option) int option; { char *fmt; switch (option) { case TELOPT_ECHO: /* we should stop echoing */ mode(0, ECHO); fmt = wont; break; default: fmt = wont; break; } if (fmt = wont) { myopts[option] = OPT_NO; } else { myopts[option] = OPT_YES; } (void) sprintf(nfrontp, fmt, option); nfrontp += sizeof (wont) - 2; } We should probably get the latest n' greatest version (which has this procedure removed) and install it. -- Jim M-S Log-Number: 31764 Date: Wed, 20 Nov 91 09:33:49 PST From: elm (ethan miller) Subject: Possible problem with telnetd? Several times I've been disconnected from terrorism while logged in from Supercomputing '91. The message has been "host terminated connection," so it's something terrorism was doing. Are there any known bugs in either telnetd or inetd that could cause this? ethan Log-Number: 31751 Date: Mon, 18 Nov 91 13:58:07 PST From: pmchen (Peter M. Chen) Subject: fscmd -f after a prefix -U I just crashed mustard (ds5000) after unmounting the local disk, deleting the prefix, then flushing the cache with fscmd -f. The error message said something about pointer pointing to an invalid block. I haven't tried to repeat this. This was on 1.106. Shouldn't prefix -U make sure that all dirty blocks are written to disk before unmounting? Pete Log-Number: 31756 Date: Mon, 18 Nov 91 19:57:55 PST From: pfile (Rob Pfile) Subject: sendmail death sendmail seems to have died sometime after 5:30 today. Although in the process table it doesnt seem abnormal, test mail to myself is backed up on a non-sprite machin, where the mailq program reports that sprite is down. i'm a gonna restart sendmail. rob Log-Number: 31757 Subject: "wrong server ID" RPCs from allspice Date: Tue, 19 Nov 91 12:21:13 PST From: Mike Kupfer <kupfer> I've been seeing in sage's and oregano's syslogs a fair number of messages like Warning: Rpc_Dispatch, wrong server ID 18 RPC 2 flags 214 Client 14 at address: 08:00:20:00:05:6d The server ID and RPC number vary. The flags always seem to be 214 (ACK|SERVER|CLOSE), and the sender is always allspice. These messages are also showing up in raid1's and lust's syslogs (though in lust's case the sender isn't always allspice). mike Log-Number: 31758 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 19 Nov 1991 12:28:23 PST Subject: Re: "wrong server ID" RPCs from allspice This is a known bug (eg #00671) that has been around for a while. It has something to do with sending ack packets at interrupt level. John Log-Number: 31767 Subject: mysterious lust crash Date: Wed, 20 Nov 91 20:51:57 PST From: Mike Kupfer <kupfer> Lust died running the 1.105 kernel. It got itself into an infinite loop of LE ethernet: Bogus receive interrupt. Buffer 0xbe804028 owned by chip. Entering debugger... LE ethernet: Bogus receive interrupt. ... I reset it and booted the 1.106 kernel. mike Log-Number: 31768 Subject: "dup2: invalid argument" Date: Wed, 20 Nov 91 21:06:55 PST From: Mike Kupfer <kupfer> With the 1.106 kernel, and perhaps with earlier kernels, "shutdown -S 0" yields a message dup2: invalid argument The system shuts down okay, but I have to wonder what's causing that message. mike [5-Dec-1991: this is probably from "shutdown"'s invoking "wall -l". Use "shutdown -q" instead of "shutdown -S 0". -mdk] Log-Number: 31769 Subject: multiple copies of exit() when linking pmake on ds3100 Date: Thu, 21 Nov 91 21:52:21 PST From: Mike Kupfer <kupfer> When I try to link pmake on a DECstation, I get /usr/lib/libc_g.a(exit.go): exit: multiply defined Sure enough, the pmake main.c defines its own version of exit(). I think that ld wants to load in the libc exit() so it can satisfy some global variables that the libc exit.c defines. (These variables are used by atexit().) Obviously we used to be able to link pmake. I think the problem results from recent changes to the libc exit.c and atexit.c. Maybe we need a separate exitVars.c whose sole purpose in life is to define these global variables. mike P.S. My temporary workaround will be to turn off the exit() in pmake's main.c and just use the libc exit(). Log-Number: 31770 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 22 Nov 1991 11:05:59 PST Subject: bug in freopen I've fixed a bug in freopen and installed a new C library containing the fix. There was a test for STDIO_NOT_OUR_BUF that was backwards, so that the wrong buffer would get used when the stream was reinitialized. In most cases it worked ok, except if the user had done a setbuf or one of its variants, in which case the buffer would be lost and a standard buffer substituted in its place, or if the user did an freopen of a closed stream, in which case the application would die. Freopen on a closed stream is not an officially supported operation, but some applications do it anyway (I guess it works on Unix). John Log-Number: 31771 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 22 Nov 1991 11:56:10 PST Subject: libc problem There is something wrong with the installation process for libc. I installed a new libc for the sun3 and sun4's, but the resulting library caused the following error in ld: ld: malformatted header of archive member in /sprite/lib/sun4.md/libc.a I had to go to /sprite/lib/sun{4,3}.md and do ar d libc.a __.SYMDEF followed by ranlib to get things to work. John Log-Number: 31777 From: rab (Robert A. Bruce) Subject: greed's local disk Date: Fri, 22 Nov 91 15:04:27 PST Greed's local disk will not work with either the 105th or 106th kernel. The 99th kernel works fine. The new kernels think that /graphics is an lfs filesystem, but I don't think it really is. -bob Log-Number: 31778 Subject: too many revisions in aliases file (RCS bug) Date: Fri, 22 Nov 91 15:07:59 PST From: Mike Kupfer <kupfer> I just checked in revision 1.240 of /sprite/lib/sendmail/aliases. When I attempted to do an rlog (to verify that "deleteuser" was generating correct log messages), I got rlog error, line 1202: Hashtable overflow rlog aborted Other RCS programs (co, rcs) behaved the same way. I renamed the RCS file to aliases,v.tooBig and made a new one. mike Log-Number: 31780 Subject: Fs_UserClose can return wrong value? Date: Fri, 22 Nov 91 21:33:35 PST From: Mike Kupfer <kupfer> Consider the following code from Fs_UserClose: status = Fs_GetStreamPtr(procPtr, streamID, &streamPtr); if (status != SUCCESS) { /* * Fudge the return status. A close() can only return EBADF or * EINTR, so return something that maps to EBADF even if it * doesn't make sense here. Sprite system calls are going * away soon anyway. */ if (status != GEN_ENOENT) { return(FS_NEW_ID_TOO_BIG); } return(status); } It seems to me that this code returns either ENOENT or EBADF, rather than EINTR. I think the test should be if (status != GEN_EINTR) { return(FS_NEW_ID_TOO_BIG); } mike Log-Number: 31782 Date: Sat, 23 Nov 91 12:55:09 PST From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice mystery crash Allspice was down this morning. Its console was full of entering debugger messages. I couldn't attach the debugger so I reset and rebooted it. Ken Log-Number: 31783 Subject: unused arguments in fscacheBlocks.c Date: Sun, 24 Nov 91 16:43:39 PST From: Mike Kupfer <kupfer> Lint complains about unused arguments to assorted routines in the fscache module. Anyone know what the scoops is on these routines (e.g., are the unused arguments no longer needed)? mike -- fscacheBlocks.c: fscacheBlocks.c(1672): warning: argument writeTmpFiles unused in function CacheWriteBack fscacheBlocks.c(2578): warning: argument onFront unused in function Fscache_ReturnDirtyFile fscacheBlocks.c(2909): warning: argument backendPtr unused in function Fscache_ReserveBlocks fscacheBlocks.c(2959): warning: argument backendPtr unused in function Fscache_ReleaseReserveBlocks Log-Number: 31787 From: mendel (Mendel Rosenblum) Subject: Re: unused arguments in fscacheBlocks.c Date: Mon, 25 Nov 91 17:09:04 PST > > > Lint complains about unused arguments to assorted routines in the > fscache module. Anyone know what the scoops is on these routines > (e.g., are the unused arguments no longer needed)? > > mike > -- These are caused by several different things. > > fscacheBlocks.c(1672): warning: argument writeTmpFiles unused in function CacheWriteBac This was left over from the many different write policies that Mike Nelson added for his thesis. It could be removed without any problem. > fscacheBlocks.c(2578): warning: argument onFront unused in function Fscache_ReturnDirtyFile I don't know about this one. I suspect that it was designed for future expansion that never happened. > fscacheBlocks.c(2909): warning: argument backendPtr unused in function Fscache_ReserveBlocks > fscacheBlocks.c(2959): warning: argument backendPtr unused in function Fscache_ReleaseReserveBlocks Arguments are included for consistency with other calls. That is, all calls of this form take the backendPtr as the first argument even if the current implmenetation doesn't use it. Mendel Log-Number: 31784 Date: Mon, 25 Nov 91 14:53:48 PST From: shirriff (Ken Shirriff) Subject: Lust crash - ethernet reset Lust crashed with the following: resetting network interface PMAD-AA Deferring reset LE ethernet: transmit buffer owned by chip Fatal Error: Can not output first packet on restart. Log-Number: 31785 Date: Mon, 25 Nov 91 16:00:04 PST From: elm (ethan miller) Subject: problems with sending mail from emacs on ds3100 Sending mail from heresy (a ds3100) to elm or elm@ginger results in a Bad user TLB fault in process e4898: pc=463754 addr=646e6553 This bug is repeatable. It happens when an emacs on heresy tries to send mail (use C-x m to get a mail buffer, and then send it off). The addresses elm or elm@ginger will cause this error. I don't know about other ones. The mail sent in this way never reaches either recipient or (as an error message) sender. ethan Log-Number: 31788 Subject: Re: problems with sending mail from emacs on ds3100 Date: Mon, 25 Nov 91 22:44:24 PST From: Mike Kupfer <kupfer> I think the problem has something to do with vfork and the stack. Emacs alloca's an argument array, vforks, and passes the array to the child, which is supposed to exec sendmail. The string addresses in the parent are new_argv[0] = (0x854a20) `/usr/lib/sendmail' new_argv[1] = (0x891210) `-oi' new_argv[2] = (0x891218) `-t' new_argv[3] = (0x891234) `-oem' new_argv[4] = (0x891240) `-odb' The addresses in the child are (gdb) print new_argv[0] $1 = 0x646e6553<Address 0x646e6553 out of bounds> (gdb) print new_argv[1] $2 = 0x2e676e69<Address 0x2e676e69 out of bounds> (gdb) print new_argv[2] $3 = 0x6f642e2e<Address 0x6f642e2e out of bounds> (gdb) print new_argv[3] $4 = 0x656e<Address 0x656e out of bounds> (gdb) print new_argv[4] $5 = 0x891240 "-odb" As a workaround, I'll hack Emacs to use fork() instead of vfork(). mike Log-Number: 31789 Subject: more mips compiler flakiness (whining) Date: Tue, 26 Nov 91 16:56:08 PST From: Mike Kupfer <kupfer> Compiled with -g3 -O, the following lines of code (the starting line number is 276) if (!migrated) { ProcSetupEnviron(procPtr); } Fs_InheritState(parentProcPtr, procPtr); get compiled into [procFork.c: 276] 0x330: 8fa90020 lw t1,32(sp) [procFork.c: 276] 0x334: 00000000 nop [procFork.c: 276] 0x338: 15200003 bne t1,zero,0x348 [procFork.c: 276] 0x33c: 00000000 nop [procFork.c: 277] 0x340: 0c000000 jal ProcSetupEnviron [procFork.c: 277] 0x344: 02002021 move a0,s0 [procFork.c: 285] 0x348: 8e020014 lw v0,20(s0) [procFork.c: 285] 0x34c: 00000000 nop [procFork.c: 285] 0x350: 30420002 andi v0,v0,0x2 [procFork.c: 285] 0x354: 10400008 beq v0,zero,0x378 [procFork.c: 285] 0x358: 8faa0024 lw t2,36(sp) [procFork.c: 286] 0x35c: 8fa40024 lw a0,36(sp) [procFork.c: 286] 0x360: 0c000000 jal Fs_InheritState [procFork.c: 286] 0x364: 02002821 move a1,s0 Note the funky test before the call to Fs_InheritState. With the optimizer off we get [procFork.c: 276] 0x474: 8faf0020 lw t7,32(sp) [procFork.c: 276] 0x478: 00000000 nop [procFork.c: 276] 0x47c: 15e00004 bne t7,zero,0x490 [procFork.c: 276] 0x480: 00000000 nop [procFork.c: 277] 0x484: 8fa40028 lw a0,40(sp) [procFork.c: 277] 0x488: 0c000000 jal ProcSetupEnviron [procFork.c: 277] 0x48c: 00000000 nop [procFork.c: 280] 0x490: 8fa40024 lw a0,36(sp) [procFork.c: 280] 0x494: 8fa50028 lw a1,40(sp) [procFork.c: 280] 0x498: 0c000000 jal Fs_InheritState [procFork.c: 280] 0x49c: 00000000 nop Can we puleeeze puleeeze get the new MIPS compiler installed soon? mike Log-Number: 31790 From: rab (Robert A. Bruce) Subject: allspice hung Date: Tue, 26 Nov 91 22:45:19 PST Allspice hung about an hour ago. I took a core dump and rebooted. -bob Log-Number: 31791 Date: Wed, 27 Nov 91 12:17:36 PST From: pmchen (Peter M. Chen) Subject: /dev/null write permissions wrong When I logged on (mustard, ds5000, 1.106), I found /dev/null to look like: crw-r--r-- 1 root 6, 0 Nov 27 12:13 /dev/null So I chmod'ed it on allspice to crw-rw-rw- 1 root 6, 0 Nov 27 12:16 /dev/null Pete Log-Number: 31793 Date: Wed, 27 Nov 91 13:42:38 PST From: tve (Thorsten von Eicken) Subject: tar s option doesn't seem to work The tar man page claims that the s option can be used to strip leading slashes off pathnames. I'm restoring files from an "absolute" tar archive and can't get tar to put the files in the current directory (which is what the s option should do, right?). Bug? TvE Log-Number: 31795 From: rab (Robert A. Bruce) Subject: Re: tar s option doesn't seem to work Date: Wed, 27 Nov 91 15:21:34 PST > I think the man page is for the GNU tar, which is installed as tar.gnu. Um. Well, actually the situation is more complicated than that. Both tar and tar.gnu are gnu tar. Tar is an old version, and tar.gnu is the newer version. It was installed as tar.gnu until we were sure there were no problems with it, and no one ever went back an renamed it. The man page is for the old unix tar, so a lot of the thing in it are wrong. I have been meaning to write a man page for gnu tar, but I forgot about it till now. Anyway, if you use tar.gnu, it will strip off leading /'s by defaualt. You have to give it an option to make it use absolute path names. -P, +absolute-paths don't strip leading "/"es from file names You can get a complete set of options by typing `` tar.gnu +help ''. I will install tar.gnu as the default tar and make sure dump and restore are set up to exec the correct one. -bob Log-Number: 31797 Date: Thu, 28 Nov 91 17:08:35 PST From: pmchen (Peter M. Chen) Subject: stickiness lost I have a couple sticky files which, every so often, seem to become unsticky. This was on mustard, ds5000, running 1.106, but it's happened using other kernels. This was the file stat after it lost its stickiness mustard% ls -l migDisallow -rwx--x--t 1 root 32 Nov 25 16:07 migDisallow* This was the file stat before (I think) it lost its stickiness mustard-3# chmod 7711 !$ chmod 7711 mig* mustard-4# !ls ls -l mig* -rws--s--t 1 root 31 Nov 25 16:07 migAllow* -rws--s--t 1 root 32 Nov 25 16:07 migDisallow* This is in ~pmchen/bench/adaptWl. Pete Log-Number: 31800 Subject: Re: stickiness lost Date: Sat, 30 Nov 91 12:29:28 PST From: Mike Kupfer <kupfer> "s" means setuid/setgid, not "sticky". The "t" indicates stickiness. There might be a cron job that is turning off the setuid/setgid bits, though I didn't see it when I looked in the system crontabs. Having setuid shell scripts is a major security hole, though I suppose it's not any worse than some of the other Sprite security holes. Still, I don't see what you need those scripts for, or at least I don't see why they have to be setuid to root. (Nor does there seem to be much point in making them sticky, come to think of it.) mike Log-Number: 31801 Subject: Re: stickiness lost - oops Date: Sat, 30 Nov 91 13:57:14 PST From: Mike Kupfer <kupfer> Oops, for my previous message I tried running "migcmd -I" to see whether you had to be root, and I think I accidently ran it from a root shell. There are ways to conveniently enable/disable migration that are safer than what you currently have. One is to write a setuid C program that invokes migcmd via the system() library routine. Unfortunately, this isn't completely straightforward, because system() relies on having a reasonable path set up. A second option is to take away all access to "other" (i.e., mode 4750) for your scripts. (Or at least I think this will be secure.) mike Log-Number: 31798 Date: Fri, 29 Nov 91 17:05:05 PST From: shirriff (Ken Shirriff) Subject: /scratch1 unmounted I unmounted /scratch1 because it is failing. Allspice crashed twice due to hardware errors on /scratch1. Log-Number: 31804 From: rab (Robert A. Bruce) Subject: LfsSetSegUsage fatal error Date: Mon, 02 Dec 91 00:18:56 PST Lust crashed. Fatal error: LfsSetSegUsage bad sement number 776541 I tried to take a core dump from ginger but it didn't work. Is kgcore supposed to work with decStations? -bob Log-Number: 31805 From: rab (Robert A. Bruce) Subject: cleaning loop Date: Mon, 02 Dec 91 02:32:13 PST Allspice got stuck in a loop while cleaning /user2. There was a long stream of messages like this: /user2: Cleaning started -- deficit 32 segs /user2: Cleaned 6174 segments in 6890 /user2: Cleaning started -- deficit 32 segs /user2: Cleaned 6178 segments in 6895 /user2: Cleaning started -- deficit 32 segs /user2: Cleaned 6182 segments in 6900 /user2: Cleaning started -- deficit 32 segs /user2: Cleaned 6186 segments in 6905 .... Each time the first number increased by 4 and the second number increased by 5. I tried to put allspice in the debugger. I hit break-W to sync the disks and then break-D. But allspice hung while syncing the disks and I had to hit the watchdog reset. -bob Log-Number: 31806 Date: Mon, 2 Dec 91 08:34:04 PST From: ouster (John Ousterhout) Subject: Allspice hung this morning When I came in Allspice wasn't responding to clients, but seemed OK from the console. I L1-A'ed and continued it, and then everyone recovered, except tyranny and sedition. Both of these machines went into infinite recovery loops doing recovery, then producing a console message on Allspice something like "FsRmtClientVerify: no such handle ...", then recovering again. I rebooted both of these machines remotely. -John- Log-Number: 31807 From: culler (David Culler) Subject: Booting a DS3100 Date: Mon, 02 Dec 91 10:22:55 PST In trying to boot Cardamom this morning (Turkey day had sent it into the debugger) I tried boot -f mop()new It hung after printing: configuring cache boot -f mop() worked fine. D. Log-Number: 31809 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 2 Dec 1991 14:19:34 PST Subject: makedepend of kernel sources Symbolic links have recently been added to /sprite/src/kernel/Include of the form mach.h -> $MACHINE.md/mach.h. Unfortunately when makedepend is run in the kernel sources it often comes up with the wrong mach.h file. This happens if I run makedepend for a different machine type than my current host. The -I options to makedepend tell it to look in the machine dependent directories first, so the symbolic link should never be seen. Somehow makedepend is looking in the machine independent directory first and getting the wrong mach.h. This happens in the dev module and the vm module. Try running "pmake depend TM=sun4" while logged on to a ds3100. John Log-Number: 31820 Date: Tue, 3 Dec 91 23:58:47 PST From: shirriff (Ken Shirriff) Subject: Re: makedepend problems The problem with makedepend is that it would look in the current directory if you include <foo.h>, but it should only look in the current directory if you include "foo.h". I've installed a patched makedepend that should fix this problem. It seems to work, but I wouldn't be surprised if something else breaks. Ken Log-Number: 31811 Date: Mon, 2 Dec 91 16:11:25 PST From: shirriff (Ken Shirriff) Subject: Lust network deadlock Lust died: resetting network interface PMAD-AA Deferring reset LE ethernet: Deadlock (*) @0x8011c5e0 HolderPC: 0x800ae7e0 currentPC 0x800ac9b8 Entering debugger at 800e506c (Note: (*) is N with a tilde above it; I don't know why that was in the error message.) Log-Number: 31812 Date: Mon, 2 Dec 91 16:33:20 PST From: mendel (Mendel Rosenblum) Subject: Re: Lust network deadlock > LE ethernet: Deadlock (*) @0x8011c5e0 > HolderPC: 0x800ae7e0 currentPC 0x800ac9b8 > Entering debugger at 800e506c The problem is that NetLEOutput() calls NetLERestart() if the call to OutputPacket() returns an error. Since both NetLEOutput() and NetLERestart() start by grabing the master lock on the interface we get a deadlock. Mendel Log-Number: 31813 Subject: load on allspice Date: Mon, 02 Dec 91 18:20:02 PST From: Mike Kupfer <kupfer> Well, allspice is getting the stuffing beaten out of it: the load is around 13, and it's impossible to log on because of the lack of response at the console. (1) I saw a lot of "reinit rcv unit" messages on allspice's console. Isn't there a buffer limit or something that we can raise to make this problem go away? (My understanding is that reinitializing the net causes positive feedback--i.e., makes the network load worse--because it locks out communication, which leads to retransmissions, so this seems like something we should try to fix.) (2) /user6 and, to a lesser extent, /swap1 are getting used heavily. Who's using them? Beats me--the RPC trace is turned off by default, and you have to log in to turn it on. Either we should leave it on by default, or there should be an L1 command to toggle it on/off. (Also, L1-z should report whether it's on or off.) (3) L1-r showed a bunch of sendmails. From looking at the mail queue, I imagine they're the sendmails that are processing tcl mailing list messages. The current sendmail config file will start up new sendmails until the load gets to 8. Should we make this smaller? (I think the sendmail load is small compared to the RPC load, though.) mike [5-Dec-1991: re: (1), from discussion at the Sprite meeting: it's not practical to raise the number of buffers to the point where it would actually make a difference. Also, "reinit rcv unit" isn't as expensive as resetting the entire network interface. -mdk] Log-Number: 31814 Date: Mon, 2 Dec 91 22:00:03 PST From: shirriff (Ken Shirriff) Subject: Mail problem I tried sending myself mail from shallot to see if mail is getting through and I got the following reply: |From MAILER-DAEMON@shallot.berkeley.edu Mon Dec 2 21:52:56 1991 |Date: Mon, 2 Dec 91 21:52:46 PST |From: MAILER-DAEMON@shallot.berkeley.edu (Mail Delivery Subsystem) |Subject: Returned mail: Remote protocol error |To: shirriff@shallot.berkeley.edu | ----- Transcript of session follows ----- |554 shirriff@sprite... Remote protocol error | ----- Unsent message follows ----- |Received: by shallot.Berkeley.EDU (4.1/1.42) | id AA23026; Mon, 2 Dec 91 21:52:46 PST |Date: Mon, 2 Dec 91 21:52:46 PST |From: shirriff (Ken Shirriff) |Message-Id: <9112030552.AA23026@shallot.Berkeley.EDU> |To: shirriff@sprite |Subject: test I don't know what a remote protocol error is. I received the above bounce message successfully on sprite. I sent another message and it got through. There are a couple messages people have sent me that haven't gotten through. Ken Log-Number: 31817 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 3 Dec 1991 16:32:05 PST Subject: allspice crash due to rename Allspice crashed earlier today due to a bug in rename. Actually, the rename problem caused it to hang RPCs. L1-i finished it off. I then managed to wedge lust and allspice trying to fix the bug. Sorry about that. The bug occurs when you try to rename a file to a link to itself. For example, imagine that "a" and "b" refer to the same file and you do rename("a", "b"). The server will try to lock both handles, which are really the same handle and will deadlock. I'm working on a fix. John Log-Number: 31818 From: rab (Robert A. Bruce) Subject: ethernet chip loop Date: Tue, 03 Dec 91 18:02:25 PST The ethernet chip on sassafras is caught in a loop. The console keeps printing this message over and over: LE ethernet: Bogus receive interrupt. Buffer 0xfffc0078 owned by cheip. Entering debugger with a Interrupt Trap (16) exception at PC 0xf60b6354 LE ethernet: Missed a packet. LE ethernet: Bogus receive interrupt. Buffer 0xfffc0078 owned by cheip. Entering debugger with a Interrupt Trap (16) exception at PC 0xf60b6354 LE ethernet: Missed a packet. LE ethernet: Bogus receive interrupt. Buffer 0xfffc0078 owned by cheip. Entering debugger with a Interrupt Trap (16) exception at PC 0xf60b6354 LE ethernet: Missed a packet. ... It will not respond to L1-D or L1-A, so I am going to do a power-cycle. -bob Log-Number: 31819 Date: Tue, 3 Dec 91 21:26:10 PST From: shirriff (Ken Shirriff) Subject: Allspice wedged Allspice was totally wedged and wouldn't respond to L1-anything. I did a reset and continue; it printed a few lines and then crashed uninformatively. So I rebooted. Log-Number: 31825 Date: Thu, 5 Dec 91 06:19:17 PST From: voelker (Geoffrey M. Voelker) Subject: allspice Just before 6:00 this morning allspice hung with a screenful of recv unit reinitializations, and RPCs were hanging to Lust. I put allspice into monitor and then continued it, and it tried to recover. In doing so it rebooted and eventually came back up. Lust seemed fine once allspice was. -geoff Log-Number: 31826 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 5 Dec 1991 11:30:51 PST Subject: atof can't handle MAXDOUBLE The atof function overflows on the MAXDOUBLE value as defined in values.h I got this value off of SunOS where atof seems to work ok. Ours returns Inf. If your program uses MAXDOUBLE then gcc can't compile it because it can't read the number. John Log-Number: 31902 Subject: Sprite log munged Date: Thu, 05 Dec 91 12:33:14 PST From: Mike Kupfer <kupfer> The sequence number for the Sprite logger apparently got munched, so messages starting from last Friday were getting numbered starting from 1. I've renumbered those messages and reset the logger sequence number (there will be a gap between the renumbered messages and new messages). Action items: (1) should we restore the 31 log messages (from 1989) that got overwritten? (2) the logger should be taught not to overwrite existing files. mike Log-Number: 31907 Subject: "wall" hung my machine (whining) Date: Thu, 05 Dec 91 18:22:25 PST From: Mike Kupfer <kupfer> I just had to reboot sage because almost all my RPC channels were stuck with hung "device open"s. Grumble. mike Log-Number: 31909 Date: Thu, 5 Dec 91 23:37:20 PST From: shirriff (Ken Shirriff) Subject: New ipServer I've installed a new ipServer that doesn't die if you do a recvfrom with a bad address. The recvfrom ends up getting an IO error. (The ipServer has about 20 different exits; I bet we could improve its reliability a lot by just removing them all. I'd think that the ipServer shouldn't exit under normal conditions.) Ken Log-Number: 31910 Date: Fri, 6 Dec 91 01:58:46 PST From: voelker (Geoffrey M. Voelker) Subject: allspice Allspice went down again around 1:00 AM. Its console had a stream of messages that said: "fscacheGetDirtyFile skipping deleted file <0,62843> "58"". I put it into the monitor and continued it, but the messages continued to stream up the console. So I rebooted it, but it "found error in file desciptor map" and initiated a reboot itself. But the disk with /allspiceA was whirring like mad and allspice started to report scsi bus busy errors, so I went back into monitor and rebooted again. (I booted sd()sprite to see if the situation would improve any by using the old kernel...sorry if that was the wrong thing to do.) Oh, Lust had entered the debugger when allspice was initially hung. I booted it off of its disk. -geoff Log-Number: 31913 Date: Fri, 6 Dec 91 12:43:49 PST From: soumen@sprite.Berkeley.EDU (SOUMEN CHAKRABARTI) Subject: more about arson a typical session at arson ... [Sprite:arson72] pwd;ls /sprite/cmds/pwd: invalid argument. /sprite/cmds/ls: invalid argument. [Sprite:arson73] cd [Sprite:arson74] pwd /sprite/cmds/pwd: invalid argument. [Sprite:arson75] clear /sprite/cmds/clear: invalid argument. [Sprite:arson76] !74 pwd /sprite/cmds/pwd: invalid argument. [Sprite:arson77] clear /sprite/cmds/clear: invalid argument. [Sprite:arson78] arson, that's what i feel like ... Log-Number: 31918 Date: Fri, 6 Dec 91 13:15:10 PST From: pmchen (Peter M. Chen) Subject: two day mail delay? Check out this mail--it was sent Wed 11am; I received it Fri 9am. Pete >From bks@okeeffe.CS.Berkeley.EDU Fri Dec 6 08:38:56 1991 >Received: from okeeffe.CS.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) > id AA855619; Fri, 6 Dec 91 08:38:55 PST >Received: by okeeffe.CS.Berkeley.EDU (5.79/1.42) > id AA22313; Wed, 4 Dec 91 10:43:37 -0800 >Date: Wed, 4 Dec 91 10:43:37 -0800 >From: bks@okeeffe.CS.Berkeley.EDU (Brian K. Shiratsuki) >Message-Id: <9112041843.AA22313@okeeffe.CS.Berkeley.EDU> >To: pmchen@sprite.Berkeley.EDU >In-Reply-To: Peter M. Chen's message of Tue, 3 Dec 91 22:45:15 PST <9112040645.A >A142422@sprite.Berkeley.EDU> >Subject: QIC tape I gave you >Reply-To: bks@vangogh.CS.Berkeley.EDU Log-Number: 31926 From: mgbaker (Mary Gray Baker) Subject: mail undeliverable Date: Sun, 08 Dec 91 16:32:59 PST I don't know if this is on our end for certain, or if perhaps it could have been a problem also on the postgres end, but several pieces of mail that Margo tried to send to me from postgres were returned to her as undeliverable after 3 days. Mary Log-Number: 31920 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 6 Dec 1991 14:16:33 PST Subject: compat problem The "compress" program in /sprite/cmds.$MACHINE.compat has a problem. The following output is from a ds5000 running 1.106. John hijack<jhh 2> touch reallyLongFileName hijack<jhh 3> compress !$ compress reallyLongFileName reallyLongFileName: filename too long to tack on .Z hijack<jhh 4> which compress /sprite/cmds.$MACHINE.compat/compress hijack<jhh 5> /sprite/cmds/compress reallyLongFileName hijack<jhh 6> Log-Number: 31923 Date: Fri, 6 Dec 91 15:48:29 PST From: mottsmth (Jim Mott-Smith) Subject: Interesting X behavior. Just for the record (though it probably won't make the top-10-to-fix list): If I run Sprite's native xdvi and pop the window up on a SunOS machine, the window appears and then instantly disappears leaving a cryptic error message. On occasion I can actually press a button or two before the image vanishes. (If I run the sww version of xdvi this does not happen.) Typical error messages: X Error of failed request: BadLength (poly request too large or internal Xlib length error) Major opcode of failed request: 96 (X_RecolorCursor) Minor opcode of failed request: 0 Resource id in failed request: 0x800030 Serial number of failed request: 1039 Current serial number in output stream: 1229 X Error of failed request: BadGC (invalid GC parameter) Major opcode of failed request: 72 (X_PutImage) Minor opcode of failed request: 0 Resource id in failed request: 0x0 Serial number of failed request: 2110 Current serial number in output stream: 2190 -- Jim M-S Log-Number: 31924 Subject: info on 26 Nov. allspice hang Date: Fri, 06 Dec 91 18:57:58 PST From: Mike Kupfer <kupfer> I poked around in the core file that Bob made from Allspice on the 26th. (The core file is /home/ginger/cores/allspice.hung.Nov.26, and the kernel is 1.106.) I built up a short table that seems to show what was going on (see below). Basically, lots of processes got stuck waiting for the fsioStream.c monitor lock (lock 0xf60f4840). This was held by a process that was waiting for the lock on a cache block. The holder of the cache block lock was waiting for an I/O operation on the block to complete. Some notes: (1) the process holding the fsioStream.c monitor lock (process 2a) was doing an open of /dev/null. If I can trust the attributes that I found while poking around in its stack, the permissions on /dev/null at this time were already screwed up (0640). (2) Mayhem was rebooted the next morning. Is it possible that it hung and then caused the operation on the cache block to hang? mike ----- pcb waiting for comments --- ----- ----- 0 1 0xf60f4840 (lock) (rpc server) 2 0xf6448fc4 (cond) (rpc server) (reading swap file(?) for mayhem; waiting for busy block to complete I/O operation) 3 (Rpc_Daemon) 4 5 6 (idle server proc) 7 (idle server proc) 8 (idle server proc) 9 (idle server proc) a (idle server proc) b (idle server proc) c (idle server proc) d 0xf61a2a40 (cond) (Recov Proc) e 0xf61c9f20 (cond) (user Proc_Wait) f 0xf60f4840 (lock) (rpc server) (waiting for fsioStream monitor lock, held by process 2a) 10 0xf60f4840 (lock) (rpc server) (waiting for fsioStream monitor lock) 11 0xf60f4840 (lock) (rpc server) (waiting for fsioStream monitor lock) 12 0xf6196458 (cond) (sig pause) 13 0xf60f4840 (lock) (rpc server) (waiting for fsioStream monitor lock) 2a 0xf6be53d0 (lock) (rpc server) (opening /dev/null) (waiting for cache block lock, held by process 2) Log-Number: 31927 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 9 Dec 1991 10:56:55 PST Subject: locked cache block Allspice currently has a block in its cache that is locked due to IO but the IO never completes (I let it sit over night). The blockPtr flags don't indicate that there is an IO in progress, but a process is still waiting on the ioDone condition. The block happens to be in a swap file for loiter, so loiter wedged up. I moved all of its swap files to /lost+found so loiter would reboot, but there are still 5 stuck rpc servers on allspice. It's probably not worth rebooting immediately, but weren't we going to reboot it to bring up the 1.106 kernel anyway? The file was in /swap1, which is an LFS. We've seen these "lost io" problems before on RAID-I but attributed them to the raid driver or hardware. The stuck process on allspice is entry 0x21 in the table. John Log-Number: 31941 Date: Wed, 11 Dec 91 17:11:21 PST From: shirriff (Ken Shirriff) Subject: Re: bugs in compat version of grn The problem with the compat version of grn was that grn didn't have a declaration of "double atof()", so the compiler messed up the conversion. I added "#include <stdlib.h> and now it works. I have no idea why grn ever worked, since a recompiled non-compat version didn't work either; our include files must have changed around recently. Ken Log-Number: 31930 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 10 Dec 1991 10:42:25 PST Subject: allspice consistency problem When I came in this morning allspice was barely alive. Its process table was full of sendmail processes. I took a core (allspiceOutOfProcesses) and discovered that all the sendmails are waiting for consistency to finish so that they may start consistency callbacks themselves. Unfortunately it doesn't look like they ever make progress. Begin whining It would be nice if someone would fix the kgdb/tty stuff so it would be usable if I'm logged into another machine. The dropping of characters is really annoying. Also, sometimes if I type ^C because I don't want to see the rest of a listing it drops be back to the shell, leaving my kgdb process detached. This is annoying also. Perhaps ^C isn't the right thing to type? And last but not least, it would be nice if the Fsconsist_Info structure contained an indication of which process is currently doing the consistency stuff so one could figure out why it isn't making progress. Searching through 128 processes isn't much fun. John Log-Number: 31931 Date: Tue, 10 Dec 91 12:35:07 PST From: pmchen (Peter M. Chen) Subject: allspice disk full messages I'm getting lots of these messages: 12/10/91 12:33:16 allspice (14) RmtFile "/sprite/spool/mail/randy" <10,2227> Write-back failed: out of disk space<40008> but df reports lots of room on / mustard% df /sprite/spool/mail/randy Prefix Server KBytes Used Avail % Used / allspice 495968 435951 10420 97% And /sprite/spool/mail/randy is not that big ls -l /sprite/spool/mail/randy -rw------- 1 randy 239499 Dec 10 12:31 /sprite/spool/mail/randy I did just send mail to the raid alias (which includes randy). This was on mustard, ds5000, running 1.106. Pete Log-Number: 31934 Date: Wed, 11 Dec 91 12:04:22 PST From: mendel (Mendel Rosenblum) Subject: Re: new ds3100 c compiler While compiling the lfs kernel module on loiter: --- ds3100.md/lfsSegUsage.o --- cc -g3 -O -DKERNEL -Dds3100 -Dsprite -Uultrix -Ids3100.md -I. -I/sprite/src/kernel/Include/ds3100.md -I/sprite/src/kernel/Include -I/sprite/lib/include/ds3100.md -I/sprite/lib/include -c lfsSegUsage.c -o ds3100.md/lfsSegUsage.o --- ds3100.md/lfs.o --- rm -f ds3100.md/lfs.o ld -r -L/sprite/lib/ds3100.md ds3100.md/lfsBlockIO.o ds3100.md/lfsCacheBackend.o ds3100.md/lfsDesc.o ds3100.md/lfsDescMap.o ds3100.md/lfsDirOpLog.o ds3100.md/lfsFileIndex.o ds3100.md/lfsFileLayout.o ds3100.md/lfsIo.o ds3100.md/lfsLoad.o ds3100.md/lfsMain.o ds3100.md/lfsMem.o ds3100.md/lfsSeg.o ds3100.md/lfsSegUsage.o ds3100.md/lfsStableMem.o -o ds3100.md/lfs.o ld: ds3100.md/lfsBlockIO.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsCacheBackend.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsDesc.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsDescMap.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsDirOpLog.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsFileIndex.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsFileLayout.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsIo.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsLoad.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsMain.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsMem.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsSeg.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsSegUsage.o: version stamp: 2.10, does not match ld's: 1.31 ds3100.md/lfsStableMem.o: version stamp: 2.10, does not match ld's: 1.31 loiter% Mendel Log-Number: 31935 Subject: VmListInsert: Inserting element twice Date: Wed, 11 Dec 91 12:59:50 PST From: Mike Kupfer <kupfer> Coons crashed with this message when it tried to evict a bunch of processes. It was running a private kernel, but the only difference between it and 1.106 is some additional printf's. I poked around with kgdb, but wasn't able to get much information because of the problem with local variables being displayed with bogus values. The stack trace is below. The one thing I noticed is that in VmPageFreeInt, the call to PutOnFreeList is in the first branch of the "if"--that is, the page belongs to the kernel. mike -- #0 panic (va_alist=-2146252404) (sysPrintf.c line 220) 220 Dev_SyslogDebug(FALSE); #1 0x800eec84 in VmListInsert (itemPtr=(struct List_Links *) 0x8023c978, destPtr=(struct List_Links *) 0x8023c978) (vmList.c line 53) #2 0x800f0790 in PutOnFreeList (corePtr=(struct VmCore *) 0x80195e80) (vmPage.c line 459) #3 0x800f1758 in VmPageFreeInt (pfNum=3357785016) (vmPage.c line 1297) #4 0x800eff2c in PrepareSegment (segPtr=(struct Vm_Segment *) 0x8023bdfc) (vmMigrate.c line 598) #5 0x800ef558 in Vm_InitiateMigration (procPtr=(struct Proc_ControlBlock *) 0xc0311594, hostID=88, infoPtr=(struct .F77 *) 0xc823bcf8) (vmMigrate.c line 106) #6 0x800c21ec in Proc_MigrateTrap (procPtr=(struct Proc_ControlBlock *) 0xc0311594) (procMigrate.c line 544) #7 0x800df0e4 in Sig_Handle (procPtr=(struct Proc_ControlBlock *) 0xc0311594, sigStackPtr=(struct .F59 *) 0xc823be2c, pcPtr=(char **) 0xc823be28) (signals.c line 1230) #8 0x80037b88 in MachUserReturn (procPtr=(struct Proc_ControlBlock *) 0xc0311594) (ds5000.md/machCode.c line 1540) #9 0x80035c0c in MachSysCall () (ds5000.md/machAsm.s line 1679) Log-Number: 31942 Subject: new mips cc doesn't understand -m Date: Wed, 11 Dec 91 18:42:50 PST From: Mike Kupfer <kupfer> The 1.31 cc understood (or ignored) "-mds5000", which my top-level kernel Makefile was putting in. It would be nice if the new cc would also accept -m<machine>, but I suppose it's not a big deal. mike Log-Number: 31943 Subject: lust deadlock Date: Wed, 11 Dec 91 19:16:28 PST From: Mike Kupfer <kupfer> Lust crashed with a deadlock in the net module. It was running the 1.106 kernel, and the console message said holder PC 800ae7e0, current PC 800ac9b8 The first address is in NetLEOutput, the second one is in NetLERestart. mike Log-Number: 31951 Date: Fri, 13 Dec 91 16:41:40 PST From: pmchen (Peter M. Chen) Subject: segmentation fault on ds5000 On mustard (ds5000) running 1.106, the following program gives a segmentation violation. Apparently this has something to do with memory allocation, since if I change the definition of MAXRESULTNUM to 9000, it works fine. This happened on clove also. But, on jaywalk (sun4), it works fine. sizeof(WLRESULT) = 232 sizeof(WLSIMPLERESULT) = 24 This is 2.5 MB of memory. Pete ------------------------------------------------------- #include <stdio.h> typedef struct { double throughput; /* throughput in MB/s */ double avResponseTime; /* average response time in ms */ double cpuThink; /* cpu think time in ms */ } WLPERF; typedef struct { char *dirName; /* what directory the test files go in */ unsigned int uniqueBytes; /* total number of unique bytes touched in this run */ double reUse; /* how many times each byte gets accessed */ double hitDepth; /* average LRU depth (fraction of uniqueBytes) */ double readProb; /* fraction of reads */ unsigned int sizeMean; /* average request size in bytes */ double sizeCVar; /* coefficient of variation of request size */ int processNum; /* how many processes */ double sharing; /* how much sharing there is */ double cpuThink; /* cpu think time between I/O's (ms) */ double seqProb; /* probability of sequentiality */ unsigned int alignQuanta; /* quanta of alignment for most requests */ double alignProb; /* probability that a request is aligned */ int seed; /* random seed */ } WLPARAM; typedef struct { WLPARAM param; /* parameters achieved */ WLPARAM target; /* parameters fed to specWl to shoot for */ WLPERF perf; /* performance achieved */ } WLRESULT; #define MAXRESULTNUM 10000 /* maximum number of results we can get */ typedef struct { double paramValue; double performance; char *comment; } WLSIMPLERESULT; main() { WLRESULT previousResult[MAXRESULTNUM]; WLSIMPLERESULT simplePreviousResult[MAXRESULTNUM]; printf("hi\n"); } Log-Number: 31952 Date: Fri, 13 Dec 91 17:21:15 PST From: mendel (Mendel Rosenblum) Subject: Re: segmentation fault on ds5000 The problem is in the Sprite kernel or in your program. The Sprite kernel limits the growth of the stack segment to 2 megabytes at a time on the decStations and 8 megabytes at a time of the sparcStations. Most program don't put so much on the stack. Declaring your variables to be static will get round this problem (if the routine is not recusive). The reason for the stack limit is because the stack (unlike the heap) is grown automatically when you reference a variable outside its range. The stack starts at the top of memory and grows down towards the heap. Like: +-------------+ | Stack | | | | | V | +-------------+ | | | | | | +-------------+ | ^ | | | | | Heap | +-------------+ The questions becomes what happens if a references occurs one byte passed the end of the heap? Is this someone addressing off the end of the heap or is it the stack growing a large amount? The answer in Sprite (and Unix) is to limit the growth of the stack. The reason it is different on the sun4 and ds5000 is that someone had a program that used more than 2 and less than 8 so they changed it. Mendel Log-Number: 31953 Subject: the second reason for pmake hangs Date: Mon, 16 Dec 91 15:02:45 PST From: Mike Kupfer <kupfer> When pmake wants to re-export a job, it runs through the PCB table to find the processes that need restarting. Unfortunately, Proc_GetPCBInfo only returns information about local processes. So if pmake itself is migrated, it won't see children that are running on another machine (or migrated home). There are an ifdef and comments in Proc_GetPCBInfo that suggest that this behavior is deliberate. Does anyone know the rationale? Assuming that Proc_GetPCBInfo's behavior is correct, the answer is for pmake to query the home machine, not the machine it happens to be running on. mike Log-Number: 31954 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 16 Dec 1991 23:01:24 PST Subject: gdb and new mips cc When I run Kgdb on the ds5000.1.107 kernel I get lots of the following message: [Unimplemented kind of type: 26] I assume it's related to the new cc. Kgdb seemed to work ok after all the complaining, but I'm sure something is broken. John Log-Number: 31955 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 16 Dec 1991 23:04:25 PST Subject: hijack crash Hijack ran out of memory while running the dumps (tar.gnu) and xmh. I had to leave so I didn't look at it too long. Here are some interesting numbers from the memory module. John Total allocs = 4886610, frees = 4863301 Small object allocator: Size Total Allocs In Use 24 1980 1633682 390 32 1084 353007 699 40 7564 1393474 7535 48 6236 371588 5269 56 3020 282720 2082 64 1628 105663 719 72 1244 220015 138 80 2908 25885 1829 88 1788 183777 437 96 268 2522 45 104 204 1033 62 112 28 97 3 120 12 27945 8 128 12 50 2 136 12 46 4 144 12 173 3 152 4 89 0 160 12 64 7 168 4 35 2 176 28 44 15 184 4 94 0 192 4 32 1 200 12 13 2 208 4 4 0 216 92 172 83 224 4 605 0 232 76 683 70 240 4 16 0 248 4 4 0 256 4 11 0 264 4 5 1 280 76 2105 68 336 3756 143600 3712 4112 12 1491 3 Total 32104 4750744 23189 Bytes allocated = 2881312, freed = 606928 Large object allocator: Total bytes managed: 1571168 Bytes in use: 609312 Orig. Size Num Free In Use 1824 2 0 2 208 2 2 0 1048592 1 0 1 328 3 1 2 272 77 4 73 256 1 1 0 400 1 0 1 520 1 1 0 528 3 0 3 1016 2 0 2 216 1 1 0 1536 20 4 16 496 1 1 0 552 2 0 2 64 1 1 0 344 1 1 0 568 2 1 1 864 1 1 0 56 1 1 0 224 1 1 0 240 6 6 0 512 8 8 0 680 1 1 0 760 1 0 1 48 1 1 0 944 1 1 0 544 1 1 0 416 1 1 0 144 1 1 0 656 1 1 0 288 1 1 0 2032 1 1 0 560 2 1 1 1208 1 1 0 776 1 1 0 2064 1 0 1 1736 1 1 0 304 1 1 0 232 1 1 0 728 2 1 1 2400 1 1 0 408 1 0 1 16 1 1 0 392 1 0 1 120 1 1 0 296 2 0 2 1808 1 0 1 4336 1 1 0 6024 1 1 0 8128 1 1 0 4064 1 1 0 2352 1 1 0 4968 2 2 0 2304 1 1 0 5872 1 1 0 368 1 0 1 4368 1 1 0 49168 2 0 2 6040 1 1 0 360 1 0 1 7392 1 1 0 16384 1 1 0 41208 1 1 0 23256 1 1 0 44200 1 1 0 20808 1 1 0 65552 1 1 0 1296 1 0 1 504 1 1 0 960 1 1 0 1032 1 1 0 2104 3 2 1 440 1 0 1 936 1 1 0 72 1 1 0 1048 2 2 0 5616 1 1 0 576 1 0 1 36328 1 1 0 Log-Number: 31956 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 17 Dec 1991 17:20:57 PST Subject: new pscomm I've installed a new pscomm that fixes a bug parsing status messages from the printer. If the job changes its name then the printer sends back the job name at the beginning of the status message. Pscomm was a little too simplistic in its parsing of status messages so it would get confused. Eventually it would exit, causing another one to be started which would have the same problem. Most jobs don't change the jobname, so this bug never came up before. I tried to print something from PageMaker, and it just sat there forever. John Log-Number: 31959 Date: Wed, 18 Dec 91 13:58:51 PST From: mottsmth (Jim Mott-Smith) Subject: compatibility problem with dvi2x /usr/sww/bin/dvi2x runs on Sabotage but segment faults on Covet. -- Jim M-S Log-Number: 31962 Date: Thu, 19 Dec 91 13:48:27 PST From: pmchen (Peter M. Chen) Subject: compiling with profiling ld complains about a malformatted header of archive member in the C profiling library. This was on sabotage, sun4/75 running 1.107. Pete In directory ~pmchen/bench/specWl cc -g -pg -o specWl obj/addressTree.o obj/adjustTarget.o obj/commDoWl.o obj/createGarbage.o obj/fileSize.o obj/getLoc.o obj/getParamSpecWl.o obj/lruTree.o obj/main.o obj/mathdist.o obj/option.o obj/optionInit.o obj/param.o obj/sizefunctions.o obj/specWl.o obj/specWlIO.o obj/tree.o -lm ld: malformatted header of archive member in /sprite/lib/sun4.md/libc_p.a Log-Number: 31963 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 19 Dec 1991 15:16:30 PST Subject: mips gdb complaining When gdb or kgdb is run on things compiled with the new cc you'll get lots of messages that say: [Unimplemented kind of type: 26] These are caused by void types in the symbol table. The new compiler generates them and the current gdb/kgdb doesn't understand them. I think they work ok, they just complain a bit. We should probably roll forward to gdb 4.X. John Log-Number: 31964 From: rab (Robert A. Bruce) Subject: Re: mips gdb complaining Date: Thu, 19 Dec 91 15:37:48 PST If there is no other reason to roll forward to the next gdb, it is a simple hack to make the current gdb ignore this particular symbol type. -bob Log-Number: 31967 Date: Thu, 19 Dec 91 17:09:54 PST From: shirriff (Ken Shirriff) Subject: Re: compat problem with compress I found the problem with compatibility compress not accepting long names. Apparently compatibility compress was compiled without the -DBSD4_2 flag which is in the local.mk; this flag allows long filenames. Ken Log-Number: 31972 Date: Thu, 19 Dec 91 23:24:45 PST From: shirriff (Ken Shirriff) Subject: Re: arson out of processes The reason arson said "invalid argument" when it ran out of processes is that the vm module ran out of segments and returned VM_NO_SEGMENTS. In its infinite wisdom, Compat_MapCode converted this to "invalid argument". I've changed it so it returns "not enough memory" and installed a new csh. Ken Log-Number: 31974 Date: Fri, 20 Dec 91 20:20:18 PST From: voelker (Geoffrey M. Voelker) Subject: allspice crash Allspice went into an LFS cleaning frenzy again, this time on /local. The segment numbers were not diverging, but neither were they converging. I also made an attempt at cleaning allspice's console. It was quite dirty. (I'm sorry about not core dumping allspice; I still have yet to make the effort to get an account on ginger) -geoff Log-Number: 31975 Date: Fri, 20 Dec 91 21:30:44 PST From: voelker@miro.Berkeley.EDU (Geoffrey Voelker) Subject: allspice II >a sample of its output. The more info you give us the more >likely we'll be able to find the bug. Ooops. I should have included this in the first message. Allspice's console was covered with something like: /local: Cleaning started. Cleaned 4809 segments in 4410 segments. (something something) -- deficit 48 segments The incremenet of the number of segments it cleaned was also the increment on the number of segments it wrote out. -geoff Log-Number: 31976 Date: Mon, 23 Dec 91 15:02:13 PST From: ouster (John Ousterhout) Subject: Allspice reboot I'm responsible for Allspice's reboot this afternoon. Sendmail was hung up, and when I went up to restart the IPserver I discovered that Allspice was in an infinite cleaning loop on /local (it seemed to continually run a deficit of 44 segments). I just rebooted. -John- Log-Number: 31977 Date: Tue, 24 Dec 91 08:39:18 PST From: ouster (John Ousterhout) Subject: Allspice crash: /local corrupted Allspice crashed this morning with the following error message: Bad segment summary magic in segment 554 Corrupted segment summary block In my haste to get Allspice back running again I forgot to take a core dump. I have a feeling this bug may repeat every time Allspice tries to clean /local, in which case there will be more opportunities for core dumps. If it does repeat, I'd suggest removing /local from the mount list until consistency can be restored. -John- Log-Number: 31979 Subject: "dup2: invalid argument" still a problem Date: Mon, 30 Dec 91 13:05:37 PST From: Mike Kupfer <kupfer> I'm still getting "dup2: invalid argument" and no wall message when I invoke shutdown. This happens even if I allow a long time before shutdown (e.g., "shutdown -S 300"). It only seems to be happening with /sprite/cmds.compat/shutdown. mike Log-Number: 31980 Subject: R5 bell doesn't work on sun4 Date: Mon, 30 Dec 91 15:33:08 PST From: Mike Kupfer <kupfer> The bell doesn't seem to work with X11 R5 running on sun4's. It works okay on sun3's and DECstations. The currently installed R5 Xsun was built without debugging symbols, so the first thing to do is rebuild Xsun with -g turned on. mike